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PRESCHOOL RACIAL ATTITUDE MEASURE II: 


JOHN E. WILLIAMS; DEBORAH L. BEST, DONNA A. BOSWELL, 
LINDA A. MATTSON, ax» DEBORAH J. GRAVES 


Wake Forest University 


The earlier version of the Preschool Racial Attitude Measure 
(PRAM I) has been found to be a useful measure in attitude 
development and modification studies of young children. This 
paper describes the lengthened and otherwise revised version of 
this procedure—PRAM II. 

Standardization data are reported for 252 Caucasian and 140 
Negro children, ranging in age from 37 to 85 months (mean — 
64 months), who were tested by Caucasian and Negro examiners. 
Analyses of the racial attitude scores revealed that the measure 
had good internal consistency (r — .80), and satisfactory test- 
retest reliability (r = .55, over a one-year interval). It was dem- 
onstrated that the test may be divided into two equivalent short- 
forms, for test-retest purposes. Other findings were that the racial 
attitude scores were found to vary systematically with race of 
subject, but not with sex of subject, 10, or age. Evidence regard- 
ing race of examiner effects was inconclusive. 

It was concluded that PRAM II provides a reliable index of 
Tacial attitudes, and that the same rationale could be employed in 
the assessment of other attitudes at the preschool level. Theories 
of racial attitude development are discussed. 


ҮҮплламѕ and Roberson (1967) described a method for the 


* This study was supported by а grant to the first author from the National 
Institute of Child Health and Human Development (HD-02821). The authors 
are indebted to the administrators and teachers of the participating schools 
for their cooperation. The authors are grateful to Beth Norbrey, Kathleen 
Williams, Elaine Wright, Shirley Colquiett, and Shari Fulmer for their assist- 
ance as examiners. 

* Requests for reprints of this paper, for copies of the PRAM II manual 
and technical report, and for information concerning the loan or purchase of 
test materials should be sent to John E. Williams, Department of Psychology, 
Wake Forest University, Winston-Salem, N.C. 27109. 
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assessment of racial attitudes in preschool children. In this pro- 
cedure, subsequently named the Preschool Racial Attitude Measure 
I (PRAM I), a series of pietures were employed, each of which 
contained two human figures, one with pinkish-tan skin and blonde 
hair (*Caucasian"), the other with medium-brown skin and black 
hair (“Negro”). Each picture was accompanied by a story con- 
taining one of six positive evaluative adjectives or one of six 
negative evaluative adjectives, with the child being asked which 
of the two persons was the one described in the story. The child, 
thus, had twelve opportunities to select one of the two figures in 
response to the adjectives. With one point given for selecting the 
Caucasian figure in response to a positive adjective, and one point 
for selecting the Negro figure in response to a negative adjective, 
the score range was zero to twelve, with low scores indicating a 
pro-Negro/anti-Caucasian bias, high scores indicating pro-Cau- 
casian/anti-Negro bias, and scores around six indicating no bias. 

The foregoing PRAM I procedure has been employed in a number 
of investigations, several of which are summarized in Table 1. In 
the table, it сап be seen that the investigations have been conducted 
in а number of different locales, with groups of children differing in 
age, race, and social class. It will be noted that the mean racial 
attitude (RA) scores in all groups fell in the upper portion of the 
score range, indicating a tendency toward pro-Caucasian/anti- 
Negro attitudes in all groups. The data indicated, however, that this 
tendency was stronger among Caucas‘an children than among Negro 
children, In addition, Vocke’s (1971) data suggests that the race 
of the examiner may have had a slight effect upon the scores ob- 
tained. 

Several investigators have employed the PRAM I procedure to 
assess the outcome of studies designed to modify racial attitudes 
in preschool children. The general findings from these studies in- 
dicated that these attitudes were: easily changed by direct behavior 
modification procedures (Edwards and Williams, 1970; McMurtry 
and Williams, 1972) ; changed, but less dramatically, by modifying 
the children’s affective responses to the colors white and black 
(Williams and Edwards, 1969; McAdoo, J. L., 1970); and un- 
changed by special curriculum procedures (McAdoo, J. L., 1970; 
Walker, 1971). 

The original rationale of the PRAM I procedure was derived 
from that of the semantic differential and was based on the as- 
sumption that the “semantic space” of the preschool child embraces 
an evaluative dimension, similar to that previously demonstrated 


1 


JOHN Е. WILLIAMS, ET AL. 


"sorpnas paysyqndun шолу w0 + 


(278 = SUBI O1ZaN JO UBIJ) 
BL (OL) "qox. T oI3əN oF cg (0267) `f ‘оорУзи 
68 (0) SSI T o13əN 9-c e (0261) “Н ‘оорузи 
LS (02,) “WOH т oxo 1-9 38 (0261) "H ‘оорузи 
98 (04) ‘O'S 1 o13 6-9 Sy (1261) 990A. 
%6 (02) ‘O 'S т one” 6-5 Sy (1461) 990A 
sdno4p) 0452 N 
(0:01 = suway uvisvoneD jo usa) 
00I (12) A'N и "me Аг; 08 SL 
86 (ол) очо (IN pus T) рәхгүү neo 6-8 75 „O 
00I (69,) “ччо2 (4) ene) 11-% 9r «шәҙешәд рив euoSollT 
ӨТІ (69,) ехал, т ‘one Ir9 75 3 OSedpug 
L6 (69,) sexo], и one) SF 18 .Soüpug 
r6 (89) "neo (i) "ongo se 15 «човїшоці, 
96 (9) O0 N и neg 958 78 (6961) врізмра pus шеті 
80I (99) 0 N и "one rs ТП (4961) Uosieqoy риз зем, 
запо) upisDonn;) 
УЧ WN 1894-93835 SEBO Teroog Я 0 әв  e3y'ay N (в)зоув8увәлиү 


ounpeoo4q I ИУЧА 241 Surhojdug soipnig впоыюд ur s24008 (YY) pny 101004 ито И 
1 Ч1ЧУ1, 


6 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


among older children and adults by Osgood, Suci, and Tannenbaum 
(1957), DiVesta (1966), and others. This evaluative dimension 
is conceptualized as being defined at one extreme by a group of 
non-synonymous words which share a common meaning response 
of positive evaluation, or “goodness”; and at the other extreme by 
a second group of non-synonymous words which share a common 
meaning response of negative evaluation or “badness.” In a series 
of recent studies, it has been demonstrated that the assumption of 
such an evaluation dimension at the preschool level is tenable 
(Edwards and Williams, 1970; McMurtry and Williams, 1972; 
Gordon and Williams, 1973). This makes it feasible to employ the 
évaluative dimension to develop preschool attitude measures which 
can be coordinated with traditional adult attitude measures via 
the rationale of the semantic differential. : 

The purpose of the present paper was to describe a revised version 
of the Preschool Racial Attitude Measure, designated PRAM II, 
and to present initial standardization data for the new procedure. 
Among other changes, the revision involved a redrawing of the 
stimulus figures, a doubling of the length of the procedure, the 
study of possible race of subject and race of examiner effects, and 
a careful examination of psychometric characteristics. 


Method* 


Materials 


In the PRAM II revision, several changes were made in the 
test materials. A new set of twenty-four 8 x 10 racial attitude 
pictures was drawn in order to improve the general artistic quality 
of the stimulus materials. The skin-colors of the two figures in each 
picture were the same as in PRAM I but the difference in hair 
color was removed by drawing both figures with black hair. The 
two PRAM II figures in each picture were, thus, identical except 
for the skin-color difference—pinkish-tan vs. medium brown. In the 
series of 24 pictures, figures of both sexes were employed, and à 
variety of ages—from young children to "grandparents"—were 
represented. The figures in the series were drawn in a variety of 
sitting, standing, and walking positions, with the pictures being 
otherwise generally ambiguous as to any activities in which the 
persons represented might be engaged. 


— s The PRAM II materials and procedures are described more fully in the | 
PRAM II manual (Williams, 1971a), available upon request. Ч 
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Тһе list of six positive and six negative adjectives employed in 
PRAM I were each doubled by the addition of six more adjectives. 
Тһе twelve positive adjectives used in PRAM II were: clean, good, 
kind, пісе, pretty, smart; and friendly, happy, healthy, helpful, 
right, and wonderful. The twelve negative adjectives used were: 
bad, dirty, mean, naughty, ugly, stupid; and cruel, sad, selfish, sick, 
unfriendly, and wrong. In both adjective groups, the first six ad- 
jectives were the “old” adjectives used in PRAM I, while the second 
six are the “new” adjectives added in the PRAM II revision. In 
PRAM II, the old and new adjectives were equally distributed 
between the first half of the test (called Series A), and the second 
half of the test (Series B). 

PRAM I contained, in addition to the racial attitude items, a 
series of twelve sex-role items which assessed the child’s knowledge 
of typical sex-stereotyped behaviors, and which provided a control 
measure of general conceptual development. These same items (see 
Williams and Roberson, 1967) were incorporated into the PRAM II 
procedure. For these items, a new series of twelve 8 x 10 sex-role 
pictures was drawn, each of which displayed a male and female 
figure of the same general age, and of the same race (half of the 
pictures represented Caucasians; half, Negroes). 

In summary, the materials for the total PRAM II procedure 
consisted of 36 pictures, 24 of which were used for racial attitude 
items and 12 for sex-role items, In the standard administration of 
the procedure, the first item was a sex-role item, followed by two 
racial attitude items, with this pattern repeated throughout the test. 


Subjects 


The basic standardization group for PRAM II consisted of 272 
preschool children from Winston-Salem, North Carolina.* The 
children ranged in age from 37 months to 85 months, with a mean 
age of 64.9 months (S.D. = 7.64). Half of the children were 
Caucasian, and half were Negro, with each race group composed of 
equal numbers of males and females. As described below, half 
of each race-sex group were tested by Caucasian examiners, and 
half by Negro examiners. A three dimensional analysis of variance 
(race of subject Х sex х race of examiner) indicated that the 
mean chronological age in all groups was equivalent. The principal 
data analyses reported below were based on this group of subjects. 
Thi group wae a further expansion of the 1970-71 standardization group 


(М = 232) described in the PRAM II Technical Report #1. The additional 
subjects served to correct the sex imbalance in the 1970-71 group. 
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Data were also available for а supplemental group of 116 Cau- 
casian and 4 Negro preschoolers with a mean chronological age of 
62.0, tested primarily by Caucasian examiners. For certain analyses, 
these subjects were combined with the 272 subjects from the stan- 
dardization group to produce a combined group of 392 subjects 
with a mean age of 64.0 months, 252 of whom were Caucasian 
(mean age = 64.2) and 140 of whom were Negro (mean age = 
63.6), with both race groups composed of equal numbers of males 
and females. 

Twenty-nine Caucasian and 28 Negro subjects from the stan- 
dardization group were retested after a one year interval by an 
examiner of the same race as the one who had done the original 
testing. The mean chronological age of this retest group was 56.8 
months at the time of the first testing, and 69.6 months at the 
time of the second. 


Procedure 


All examiners were females in their early twenties. Examiners 
for the standardization study were one Caucasian graduate student, 
and four undergraduate students, two of whom were Caucasian 
and two Negro. Two additional undergraduate students, one Cau- 
casian and one Negro, served as examiners in the retest study. The 
data for the supplementary group was gathered by the above 
persons with the assistance of two additional undergraduate ex- 
aminers, one Negro and one Caucasian. All examiners were trained 
in administering the PRAM II and Peabody Picture Vocabulary 
Test (PPVT) procedures. Particular care was taken in training 
examiners not to provide incidental feedback to the children during 
the PRAM II administration. 

The standard procedure for the administration of PRAM II 
was as follows. The child was taken from his classroom to a private 
room where he and the examiner were seated at a low table. After 
some initial conversation to build rapport, the examiner placed the 
PRAM II picture notebook and answer sheet on the table and said: 

“What I have here are some pictures I’d like to show you, and 
stories to go with each one. I want you to help me by pointing 
to the person in each picture that the story is about. Here, ГИ 
show you what I mean.” The examiner then opened the notebook 
to the first (sex-role) picture of a little boy and a little girl seated, 
and read the first story: “Here are two children. One of these 
children has four dolls with which they like to have tea parties: 
Which child likes to play with dolls?” After recording the child’s 
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response; the examiner displayed the second picture of two little 
boys, one Caucasian and one Negro, walking, and read the second 
story: “Here are two little boys. One of them is a kind little boy. 
Once he saw a kitten fall into a lake and he picked the kitten up 
to save it from drowning. Which is the kind little boy?” After 
recording the child’s response, the examiner proceeded to picture 
three and story three, ete., until all 36 items (12 sex-role; 24 racial 
attitude) had been presented, In the standardization study, half 
the subjects were administered the second half of the test first, 
in order to study the equivalence of Series A (Items 1-18), and 
Series B (Items 19-36) of the procedure. 

Іп the standardization group, the PRAM II administration was 
followed by the administration of the Peabody Picture Vocabulary 
Test (PPVT), Form B, following standard directions (Dunn, 1965). 

The PRAM II racial attitude responses and sex role responses 
were scored in the following manner. The racial attitude score was 
determined by counting one point for the selection of the light- 
skinned figure in response to a positive adjective, and one point 
for the selection of a dark-skinned figure in response to a negative 
adjective. The racial attitude total score based on all 24 items thus 
had a range of 0-24, with high scores indicating a pro-Caucasian/ 
anti-Negro bias, low scores indicating a pro-Negro/anti-Caucasian 
bias, and mid-range scores (around 12) indicating no bias. In 
addition to this total score, several pairs of subscores were deter- 
mined for each subject. Each pair of subscores was based on a 
division of the subject’s responses into two halves, and each subscore 
thus had a range of 0-12: (1) a first-half (Series A) score, and 
8 second-half (Series В) score; an odd-numbered items score, and 
an even-numbered items score; an old items score and a new items 
score; and a positive adjective score and a negative adjective score. 
The 12 sex-role items were scored by giving one point for each 
Sex-appropriate response, yielding a possible score range of 0-12. 
The PPVT was scored in standard fashion to yield an IQ score 
for each subject. 


Results® 


The total racial attitudes scores (RA-T) for the 272 subjects in 
the standardization group were analyzed by race of subject, Tace 
of examiner, and sex of subject. The three-dimensional analysis of 


Ur ДЕ eee j 
5 Additional detailed results are available in the PRAM II Technical Re- 
port #1 (Williams, 1971b), available upon request. 
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variance performed to assess the effects of these variables revealed 
significant (p « .01) main effects of race of subject and race of 
examiner, and a nonsignificant main effect of sex. In addition, no 
interactions were significant. The nature of the two significant main 
effects can be seen in Table 2 which presents the mean RA-T 
scores in each of the four race of subject/race of experimenter 
groups. The race of subject effect is seen in the fact that Caucasian 
children obtained higher scores than Negro children under both 
race of examiner conditions, with an overall Caucasian subject 
mean of 17.02 and an overall Negro subject mean of 14.73. The 
significant race of examiner effect indicates that both groups of 
children obtained higher scores with Caucasian examiners than 
with Negro examiners, with an overall Caucasian examiner mean 
of 16.97, and an overall Negro examiner mean of 14.78. The sug- 
gestion in Table 2 that the race of examiner effect is greater for 
Caucasian children than for Negro children was not supported 
by the statistical analysis, since the race of the subject by race 
of examiner interaction effect was not statistically significant. 
Data from the 57 children in the retest group were used to re- 
check the two main effects just described. As noted earlier, these 
children were retested after a one year interval by different ex- 
aminers of the same race as their initial examiners. An analysis 
of the RA-T scores in this group again revealed a significant (p < 
01) race of subject effect: Caucasian subject X = 188; Negro 
subject X = 15.0. When the data was examined by race of examiner, 
no significant difference was found. An additional source of negative 
evidence regarding race of examiner effects is found in a study by 
Best (1972). In this study, each of 60 preschool Caucasian children 
was administered PRAM II by two examiners. The first examiner 
gave standard instructions and administered the first half of PRAM 


TABLE 2 
Mean Total Racial Attitude Scores for 272 Subjects Classified by Race of Subject 
and Race of Examiner 
Race of Examiner 
Caucasian Negro Total 
Caucasian ; 15.38 17.02 
N = 68) (М = 68) N = 130) 
Васе of Negro 15.28 14.18 à 14.73 
Subject (N = 68) (N = 68) (N = 136) 
Total 6.97 14.78 15.88 


16. 5 . 
(N = 136) (А = 136) (М = 272) 
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П. At this point, а second examiner entered the room, replaced the 
first examiner, and administered the second half of the procedure. 
One quarter of the subjects were tested by each of the following 
race of examiner combinations: two Caucasians; two Negroes; Cau- 
casian then Negro; and Negro then Caucasian. The analyses of 
the data obtained under these conditions provided no evidence of 
race of examiner effects. For example, the mean RA-T score ob- 
tained by the two Caucasian examiners was 17.6, compared with 
17.9 for the two Negro examiners. In addition, there was no evi- 
dence of a tendency to shift scores up or down when the race of 
examiner was changed. In sum, the race of examiner effect found 
in the standardization group was not replicated in either the retest 
group, or in the Best study. In view of this, it appears that, for the 
present, the evidence regarding race of examiner effects at the 
preschool level remains inconclusive. 

Returning to the race of subject effect, it was noted above that 
this effect was replicated in the retest Study. These findings were 
also congruent with the findings from the PRAM I studies sum- 
marized earlier in Table 1. Thus, it was concluded that Negro 
children as a group make lower racial attitude scores than do Cau- 
casian children. We will now examine the distribution of scores 
in each of these groups more closely by employing the data from 
the combined group and ignoring the question of race of examiner 
effects. 

Due to the two-choice nature of the PRAM procedure, the 
binomial distribution provided a convenient way to determine when 
an individual child was responding in a manner which would be 
unlikely on a chance basis. With 24 response opportunities, the 
probability of an unbiased child obtaining a score of 17 or up was 
only .035; the same probability existed for scores of 7 or down. 
Thus, scores іп the former category (17 up) were taken as evidence 
of a “definite” pro-Caucasian/anti-Negro bias (C+/N—), while 
Scores in the latter category (7 down) reflected a “definite” pro- 
Negro/anti-Caucasian bias (V-+/C—). Likewise, scores of 15 and 
16, 8 and 9, were taken as evidence of “probable” bias, while the 
10-14 mid-range was characterized as “unbiased.” 

In Table 3 are presented the percentages of the 252 Caucasian 
children and 140 Negro children from the combined group who 
obtained scores in each of the foregoing classes. When the observed 
frequencies in each class were compared with the chance fre- 
Quencies from the binomial distribution, a significant (p < .001) 
chi square was obtained for each group. There was a significant 
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TABLE 3 


Percentage of Total Radical Attitude (ВА-Т) Scores of Preschool Children Falling 
Into Each of Five Categories 


Caucasian Negro 
RA-T Score Subjects Subjects Chance 
Range Category (М = 252) (М = 140) Expectancy 
17-24 Definite C--/N — Bias 60.3% 39.3% 3.3% 
15-16 Probable C 4-/N — Bias 11.5% 12.9% 12.1% 
10-14 Non-Biased 23.4% 32.1% 69.2% 
8-9 Probable N--/C— 2.8% 5.0% 12.1% 
0-7 Definite N--/C— Bias 2.0% 10.7% 3.3% 


tendency toward high (pro-Caucasian/anti-Negro) scores among 
both Caucasian and Negro children. Perhaps the most dramatic 
evidence seen in Table 3 was the high degree of definite C+/N— 
bias (17 up) in both subject groups, being found in approximately 
6 out of 10 Caucasian children and 4 out of 10 Negro children. At 
the other extreme, evidence of definite N--/C— bias (7 down) was 
found in only 1 out of 10 Negro children and in only 1 out of 50 
Caucasian children. 


Relationship of Racial Attitude to Other Subject Variables 


The lack of relationship of RA-T scores to sex of subject was 
demonstrated by the nonsignificant sex effect in the analysis of 
variance described above. The possible relationships between RA-T 
scores and other subject variables were explored by means of 
product-moment correlation coefficients computed among the vari- 
ables of RA-T scores, age, PPVT-IQ scores, and Sex Role scores. 
These correlations are summarized in Table 4 where it can be seen 
that the RA-T scores were independent of both chronological age 
and РРҮТ-10. On the other hand, the Sex Role score, with which 
RA-T shows only a slight positive relationship, is significantly as- 
sociated with both chronological age and PPVT-IQ. These findings 
appear to have important theoretical implications regarding the 
origin of racial attitudes which will be discussed later in the paper. 


Internal Consistency 


The internal consistency of the 24-item RA-T scale was examined 
using data from the 392 subjects in the combined group. This was 
done by comparing the subjects’ responses to two halves of the 
scale, with the data divided in several different ways: odd items 
vs. even items; “old” adjectives vs. “new” adjectives; positive ad- 
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jectives vs. negative adjectives; and first half (Series A) vs. second 
half (Series B). In each of these comparisons a product moment 
correlation coefficient was computed between scores on the two 
12-item halves, and the Spearman-Brown correction for doubled 
length was then employed to estimate the internal consistency of 
the total 24-item scale. 

The results of these comparisons are summarized in Table 5. 
These findings indicated that the racial attitude scale possessed a 
high degree of homogeneity. The Spearman-Brown estimates for the 
usual “split-half” comparisons (odd-even; first half-second half) 
indicated that the internal consistency “reliability” of RA-T scores 
was of the order of .80. 

The findings for the old-item vs. new-item comparisons provided 
satisfactory evidence that the twelve items added in the PRAM II 
revision were measuring the same thing as the old items from 
PRAM I. The results for the positive adjective vs, negative ad- 
jective comparison indicated that scores on these two sub-scales 
were substantially related, indicating that children who chose light- 
skinned figures in response to positive adjectives also tended to 
choose dark-skinned figures in response to negative adjectives, and 
vice-versa. Thus, the positive and negative items appeared to be 
assessing different aspects of the same trait. 

The results of the Series A vs. Series B comparison were of par- 
ticular interest since the series had been designed to provide alter- 
nate short forms of the procedure. The high correlation between A 
and B scores (.71), the virtually identical means (A = 8.20; B = 


TABLE 4 


Intercorrelations among Total Racial Attitude Scores, Age, РРУТ-10, and Sex Role 
Scores for 252 Caucasian and 140 Negro Preschoolers 


Subject Chronological PPVT Sex Role 


Groups Age IQ Score 
Racial Attitude Cauc. Ss ll .00 .25* 
Scores Negro Ss .03 .06 .02 
All Ss .09 .15 .20* 
Chronological Cauc. Ss .04 .41* 
Аве Negro Ss .05 .38* 
All Ss 110 88% 
PPVT-IQ Саце. Ss a) 
Negro Ss PEN 
All Ss .33 


Note.—Correlations involving PPVT-IQ are based on Caucasian М = 136, Negro М = 132, 
Total N = 268, 
#р < 01. 
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TABLE 5 


Internal Consistency Measures for Racial Attitude Scale: Means, Correlation 
Coefficients (г), and Spearman-Brown Estimates (г SB) 


Odd-Numbered Items vs. Even Numbered Items 


X Odd X Even т r SB 
Cauc. Ss (N = 252) 8.86 8.55 .61 .76 
Negro Ss (N = 140) 7.38 7.31 .76 .86 
Total Ss (М = 392) 8.33 8.11 .69 .81 
Old Items vs. New Items 
X Old X New T rSB 
Cauc. Ss 9.23 8.18 :70 .78 
Negro Ss 7.64 7.04 .68 .81 
Total Ss 8.67 7.77 .70 .82 
Positive Items vs. Negative Items 
X Positive XŠ Negative т r SB 
Сапе. Ss 8.91 8.50 .53 .69 
Negro Ss 7.56 7.11 .68 .81 
Total Ss 8.45 8.00 61 .76 
First Half (Series A) Items vs. Second Half (Series B) Items 
X Series А X Series B т r SB 
Cauc. Ss 8.62 8.79 .65 .79 
Negro Ss 7.45 7.24 .75 .86 
"Total Ss 8.20 8.24 171 .83 


8.24) and standard deviations (A = 2.74; В = 2.79), indicated 
that the two scales could be considered аз equivalent 12 item short 
forms of PRAM II. 


Stability 


The stability of the RA-T scores across а one-year interval (12.8 
months) was assessed using the 57 subjects from the retest group. 
At the time of first testing, these subjects had a mean age of 56.8 
months; at the second testing, these subjects had a mean age of 69.6 
months. The mean RA-T score at the first testing was 15.30 while 
the mean at the second testing was 16.93, a statistically significant 
(p < .05) increase of 1.63 points. This finding suggests that there 
may have been a slight positive practice effect for the RA-T scores. 
Tt does not seem likely that the increase was attributable to the 
fact that the children were a year older, since RA-T scores were 
found not to be correlated with age, as noted above. 

Three scores (Series A, Series B, and Total) from the first ad- 
ministration were each correlated with the same three scores from 
the second administration, with statistically significant coefficients 
obtained in all instances. The correlation of .55 between total scores 
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at the two administrations provided the best available estimate of 
test-retest reliability, although the relative youth of the subjects 
suggested that this may be a minimum estimate. As is usual, this 
value was lower than the estimated internal consistency of .80, 
noted above. 

Discussion. 

The results of the studies just described have led to the con- 
clusion that the PRAM II revision was generally successful and 
that PRAM II ean be viewed as а lengthened and otherwise im- 
proved version of PRAM I which provides a psychometrically- 
sound method for the assessment of racial bias in pre-literate 
children. While һе results of the studies regarding possible race 
of examiner effects were inconclusive, the evidence regarding race 
of subject effects was clear and consistent: the racial attitude 
scores of Negro preschoolers averaged two to three points lower 
than the scores of Caucasian preschoolers. This evidence of dif- 
ference in mean scores should not obscure the fact that in both 
groups, pro-Caucasian/anti-Negro (C--/N—) bias was much more 
evident than was the reverse pro-Negro/anti-Caucasian (N--/C—) 
bias. 

Evidence such as the foregoing is often interpreted in aceord with 
the "normative" theory of prejudice which would attribute the 
C+/N— bias among preschoolers to their having acquired the 
anti-Negro prejudices of the Caucasian-dominated larger society. 
This acquisition process is presumed to be а gradual one resulting 
from the child's repeated contacts with such prejudice through the 
preschool years. It would seem to follow from the normative theory 
that whenever children are acquiring any general concept/attitude 
“from the culture”, it would be expected that: (1) progressively 
older groups of children will show progressively more evidence of 
the concept/attitude; and (2) brighter children will show more 
evidence of the concept/attitude than will duller children. Such, 
in fact, were the findings of the present study for the sex-role 
Scores which reflected the child’s knowledge of sex-appropriate be- 
haviors. Regarding racial attitude scores, however, neither of these 
Conditions were met: racial attitude scores were not correlated 
either with chronological age or with PPVT-IQ scores. Thus, high 
Tacial attitude scores were equally evident among brighter and 
duller children, and older and younger children. These findings are 
inconsistent with the requirements of the general normative theory. 

There remain at least two other plausible theories concerning 
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the origin of the reliable individual differences in attitude scores. 
It is possible that the children's attitude scores are related to 
individual differences in familial variables, e.g., attitudinal and/or 
cognitive characteristics of parents. A second possibility is that 
the children’s attitudes toward light and dark persons are influenced 
at least in part by individual differences in early learning ex- 
periences with light and darkness (“fear of the dark") which may 
be essentially independent of familial influences. In any event, it is 
clear that additional research in the areas of familial influence and/or 
early experiences will be needed to clarify the origins of the at- 
titudes being assessed by PRAM II. 

The preceding discussion raises the question as to whether the 
attitude trait measured by PRAM II should be designated as racial 
attitude, as opposed to "skin-color" attitude. This question is not 
easily answered. Part of the difficulty is the lack of systematic 
knowledge as to what “race” means to preschool children. Some 
evidence indicates that, at this age, skin-color is the most salient 
feature associated with race; and it has been demonstrated that 
preschoolers show little hesitancy in identifying persons as “white,” 
"Negro," “colored,” etc. when skin-color is the only basis for 
discrimination (Williams and Roberson, 1967). Hence, attitude 
toward skin-color and attitude toward “race” may be virtually 
synonymous at this age level. While the attitudes assessed may ог 
may not be “racial” in their origins, it seems clear that they are 
“racial” in their implications, and it is this latter usage which seems 
to warrant the designation of the PRAM II scores as measures of 
racial attitude. 

It would appear that the PRAM II procedure represents a sub- 
stantial advance in attitude assessment procedures for preschool 
children and should facilitate the study of many interesting and 
Important questions dealing with the origins, development, and 
modifiability of racial attitudes in young children. Regarding the 
latter, three studies employing PRAM II һауе already been con- 
ducted (Graves, 1973; Shanahan, 1972; Yancey, 1972), and the 
availability of alternate short forms of the PRAM II procedure 
should be of benefit to other experimenters who wish to conduct 
attitude change studies employing a pre-post design. The general 
evaluation dimension rationale on which PRAM II is based сап 
also be used to assess other types of attitudes in young children. 
For example, the rationale has been employed in a number of 
studies dealing with children’s attitudes toward the colors black 
and white (Renninger and Williams, 1966; Williams and Roberson, 
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1967; Williams and Edwards, 1969; Skinto, 1969; Williams and 
Rousseau, 1971; Figura, 1971; Vocke, 1971; Gordon and Williams, 
1973). In still another research area, the evaluative dimension 
rationale is currently being employed by the authors in pilot studies 
aimed at assessing attitudes of preschool children toward male and 
female persons. 
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STRATEGIC USE OF RANDOM SUBSAMPLE 
REPLICATION AND A COEFFICIENT OF 
FACTOR REPLICABILITY: 
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The problem of demonstrating replicability of factor structure 
across random samples is addressed. Procedures are outlined which 
combine the use of random subsample replication strategies with 
the correlations between factor score estimates across replicate 
pairs to generate a coefficient of replicability and confidence in- 
tervals associated with the coefficient. Data from the national 
norming sample of the Self Observation Scales are used in the 
illustrative example. 


Tum importance of demonstrating the replicability of factor 
structures has been the subject of discussion by Royce (1966) and 
Nesselroade and Baltes (1970). Nesselroade and Baltes argue: 
“It should be emphasized that, in general, comparative factor 
analysis remains in the dilemma of being at best a descriptive 
technique until more systematic information on factor similarity 
coefficients is available.” As early as 1947, Thurstone (1947) sug- 
gested that generalizability of factors is a major objective of factor 
analytic studies. Cattell (1966) stated: “The interpretation of 
factors, i.e., the inferring of their natures as scientific determiners, 
is closely tied to the problems of pattern matching and identifica- 
tion. (For interpreting a factor that has appeared in only one study 
would not be profitable as a rule.)” While in agreement with Cattell 
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on this issue, many factor analysts find themselves in the position 
of interpreting such factors. 

There is consensus on the desirability and importance of demon- 
strating factor replicability and invariance, and there are several 
indices of quantitative invariance available (Burt, 1948; Cattell, 
1944; Kaiser, 1960; Tucker, 1951). This consensus has led to the 
development of several sophisticated rotational techniques designed 
to produce a high degree of similarity between factor loading pat- 
terns from different sets of data (Browne and Kristoff, 1969; Cattell 
and Cattell, 1955; Eyferth and Sixtl, 1965; Fischer and Roppert, 
1964; Kristoff, 1964; Meredith, 1964; Taylor, 1967; Tucker, 1951). 
Nesselroade and Baltes (1970) have pointed out two problems 
characteristic of the practice of rotating factor structures to maxi- 
mum similarity: (a) Such procedures abandon the original rota- 
tion criteria, and (b) produce similarity coefficients with random 
data which approximate the magnitude of similarity coefficients 
found in studies using observed data. A third problem of these 
procedures and of several indices of quantitative invariance that 
have been proposed (Burt, 1948; Cattell, 1944; Kaiser, 1960; 
Tucker, 1951) is that adequate distributional data are not available. 
A final problem of procedures which deal with the factor pattern 
or factor structure matrices will be considered later in this study. 
Pinneau and Newhouse (1964) reviewed the commonly available 
indices of factor invariance and offered the coefficient of invariance 
as an alternative index of the extent to which two factors approach 
congruence. The coefficient of invariance operates on the factor 
scores rather than the factor pattern or factor structure matrices. 
It is the correlation between the estimated factor scores of subjects 
on two independently derived factors thought to be matched. This 
correlation is a useful measure of invariance of factor structures 
obtained from the analysis of a common set of variables on dif- 
ferent sets of subjects. Throughout this paper the term “replica- 
bility” will be used to refer to the reappearance of the same factors 
across random samples, while the term “invariance” will refer to 
the reappearance of the same factors when subjects have been sys- 
tematically selected. 

A promising element of research methodology relevant to demon- 
strating replicability and invariance of factors is found in the 
random subsample replication strategies. An excellent discussion 
of these strategies is provided by Finifter (1972). Finifter suggests 
the following definition for the procedures, which he labels “random 
subsample replication (RSSR)”: h жет” 


=” 
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A total body of data is constituted from, or subsequent to its 
collection, divided into two or more independent random sub- 
samples. Each subsample is a replicate of a particular sampling 
design. A distribution of outcomes for a parameter being estimated 
is generated by applying a common analysis procedure to each 
subsample. The difference observed among the subsample results 
are then analyzed to obtain an improved (namely, less variable or 
less biased) estimate of the parameter, as well as, a confidence asses- 
ment for that estimate. 

The RSSR strategies are particularly well suited to situations 
in which “statistical tests" have not been developed or are based 
on shaky theoretical formulations. Finifter argues that, even in 
cases where adequate theory for a “statistical test" exists, RSSR 
strategies provide a closer approximation to the ultimate goal of 
actual replication than do derived distribution theories (Finifter, 
1972). Furthermore, when the issue transcends statistical signifi- 
cance and becomes one of practical utility, the family of RSSR 
strategies takes on increased attractiveness. The problem of demon- 
strating factor invariance, crucial to establishing the validity of 
generalizations from factor analytically derived scales, is well suited 
to the application of RSSR methodology. 

Horst (1966) has criticized RSSR strategy in multivariate re- 
search. Horst states: 

We know that, other things being equal, the more cases we have, 
the more stable and reliable our results will be. Therefore, for the 
purposes of both application and generalization, our procedures 
must be developed on the largest sample available. If we develop 
a procedure and then cross-validate it, we have ipso facto not de- 
veloped the best procedure possible from the available data. 

This is, of course, not a criticism of RSSR strategies themselves 
but is a criticism of their improper use. When dealing with con- 
structs derived through factor analytic methods, failure to demon- 
Strate factor replicability and invariance leaves doubts about the 
validity of such inferences as might be made from the data. A 
peculiarity of factor analysis is that an increase in sample size does 
not necessarily lead to an increase in the replicability of the factor 
Structure. Since every rotation of a factor matrix is an arbitrary 
transformation, the factor structure derived from 10,000 cases in- 
Volving random responses or a structure derived from 100 such 
observations is equally nonreplicable in a second random sample. 
Further, the work of Nesselroade and Baltes (1970) and Nessel- 
roade, Baltes, and Labouvie (1971) demonstrates that. factor. 
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tures derived from random data can be rotated into positions which 
yield very high coefficients of similarity across several factors. 
Needed is a procedure which (a) provides an estimate of the agree- 
ment among factor structures (invariance) without altering the 
original rotational criteria, (b) provides an estimate of the con- 
fidence limits around this estimate, (с) includes all sources of 
variability in its estimate of invariance,? and (d) bases the final 
estimate of the factor structure on all of the available data. 

The approach implemented by the authors to meet these require- 
ments combines the use of random subsample replication strategies 
with factor analytic procedures. Correlations between the factor 
score estimates on matched factors for each replicate pair are used 
to generate a coefficient of replicability which provides both an 
estimate of factor replicability and confidence limits associated 
with the estimate. 


Method 


Тһе data used in this study were obtained from the national 
norming and validation of the Primary Level of the Self Observa- 
tion Scales (SOS)? (Stenner and Katzenmeyer, 1973). The sub- 
jects of the study included first, second, and third graders who re- 
sponded to the 50 items of the (SOS) Primary Form A during April 
and May 1973. 

A sample of 6,300 cases was divided into four random subsamples 
(replicates) of 1,575 cases each. The steps involved in the separate 
factor analyses of each random subsample are described in il 
through 5. Steps 6-9 describe the procedures involved in computing 
the coefficients of replicability, 

1. A matrix of phi coefficients was computed. When a missing 
datum was encountered, the mean value for that variable was 
inserted. The percentage of missing data was less than 4%. 

2. Squared multiple correlations were entered as initial com- 
munality estimates. Iteration for communalities proceeded until the 
maximum absolute deviation between iterations dropped below .001. 


2 Similarity. coefficients which deal with the factor structure or factor pattern 
matrices do not include variance due to the indeterminacy associated witl 
estimating factor scores when the common factor model is employed. 

з The Primary Level of the Self Observation Scales (SOS) is a direct 
self-report, group-administered instrument comprised of 45 items (Forms 
and B). The SOS (primary level) measures five dimensions of children’s 
affective behavior: (1) Self Acceptance, (2) Social Maturity, (3) School 
Affiliation, (4) Self Security, and (5) Achievement Motivation. The first four 


scales were factor analytically derived. The fifth was developed using dis- 
criminant analysis. 
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3. А rotation to the varimax criterion was performed. 

4. The orthogonal ‘varimax solution was rotated to maximum 
oblique simple structure, using the Maxplane criterion (hyperplane 
width .15). 

5. The matrix of loadings of the variables on the factors V (fe) 
was computed using Vj = R;3V;, where Ryt is the inverse of 
the matrix of correlations among the variables and Vj, is the 
oblique factor structure (matrix of correlations of factors and 
variables). 

6. Scores on each variable (question) for the total group (6,300 
cases) were converted to z scores and factor score estimates (least 
squares regression estimates) were computed for each subject, on 
each factor, using the four V;’s. Since four factors were identified 
in each of the rotations, this procedure resulted in 16 factor score 
estimates for each subject: four factor score estimates on each of 
four factors. 

7. Correlations between the estimated and true factor scores 
were computed (multiple correlation of the estimated scores with 
the 45 variables of the data matrix, which is also the standard 
deviation of the estimated factor scores). 

8. Correlation coefficients between factor score estimates from 
each replicate pair (six pairs) were computed. This procedure pro- 
duced six estimates of the coefficient of replicability for each 
factor. 

9. Coefficients of replicability and confidence intervals associated 
with these coefficients were obtained in the following manner: 
Fisher’s г to 2 transformation was performed with each of the six 
coefficients of invariance obtained for each factor. The means and 
standard deviations of Fisher z values were obtained and confidence 
intervals computed (р < .05, р < .01). The г equivalents of the 
mean Fisher z value and of the 95 and 99% confidence limit 2 
values were computed. 


Results 


Table 1 presents the eigenvalues (squared multiple E's in the 
diagonal) of the correlation matrices derived from the four random 
subsamples. 

The eigenvalues are consistent across replications; the choice of 
four factors seems appropriate whether by Guttman’s middle bound, 
the Scree test, or interpretability of factors. 

Table 2 presents the percentage of variables in hyperplanes of 
varying widths after completion of the Maxplane rotation. 
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TABLE 1 
First Five Eigenvalues for the Four Replication Samples 


Eigenvalue 
и 2 3 4 5 
Subsample 
I 4.62 2.74 1.52 1.03 0.49 
и 4.36 2.88 1.57 1.07 0.49 
ІП 4.42 2.93 1.50 1.06 0.46 
ТУ 4.54 2.64 1.62 1.04 0.44 


Cattell (1966) suggests as а "rule of thumb" that in most do- 
mains from 55 to 8596 of the variables in the hyperplane is in- 
dicative of adequate simple structure.* 

Table 3 presents the correlations between factor score estimates 
on matched factors for each replicate pair. The 1-2 line presents the 
correlations between the factor score estimates obtained from anal- 
ysis of sample (replicate) one and the factor score estimates ob- 
tained from replicate two. 

Table 4 presents the coefficients of replicability for each of the 
factors, together with the upper and lower limits associated with 
the 95 and 9996 confidence intervals. It is worth noting that the 
standard error of the coefficient of replicability includes error 
variance from at least two sources: (a) sampling error resulting 
from the selection of the random subsamples, and (b) deviations 
of the factor score estimates from the true factor scores. Guttman 


* This criterion is, however, a function of the number of factors extracted. 
Rotation of' these same data for 14 factors yielded 80% of the variables in 
the hyperplane. Another study by the authors (in press) revealed hyperplane 


counts of 90% in the .15 hyperplane when rotating for seven factors with 
random data. 


TABLE 2 


Percent Variables in Hyperplane for Varying Widths in Each of the Four Replication 
Samples 


Hyperplane Width 


.05 
Subsample 40 Eb n 
I 33.3 51.7 68.3 72.8 
Zr 36.1 54.4 10.0 12.8 
II 28.3 52.2 69.4 12.8 
IV 33.9 54.4 68.9 72.2 
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TABLE 3 
Correlations between Factor Score Estimates on Each Replicate Pair 


Replicate Self Social Self School 
Pair Acceptance Maturity Security Affiliation 
1-2 9f. .99 .96 :99 
1-3 .94 .99 .98 .99 
1-4 .97 .97 .99 .99 
2-3 .98 -99 .94 .99 
2-4 .96 .96 „97 .99 
3-4 .91 .95 .97 . 98 


(1955) has observed that, in common factor analysis, the regression 
estimates may depart considerably from the true scores and has 
suggested methods for computing the correlation between the esti- 
mates and the true factor scores. Guttman has shown that the 
theoretical minimum correlation between the alternative factor 
Score estimates drops rapidly as the correlation between the true 
factor scores and the regression estimates becomes lower.5 Knowing 
the theoretical maximum error is, however, not very useful to the 
researcher, except to generate temperance in the interpretation of 
his results. The coefficient of replicability provides an estimate of 
the maximum loss due to the difference between the true factor 
scores and the regression estimates of these scores. The standard 
coefficient of invariance, reflecting both the sampling error and the 
error due to imperfect regression estimates, allows us to set an upper 
bound on the amount of error that is associated with the difference 
between the true factor scores and the factor score estimates, It is 
suggested that (1 — r,2, where r, is the coefficient of replicability 


5 For example, if the correlation between true scores and regression esti- 
mates drops to .90, the maximally different set of scores would have a correla- 
tion of only .62 with the original estimates. 


TABLE 4 
Coefficients of Replicability and Their Associated Confidence Limits 
Lower Limits Upper Limits 
т, р<.01 р<.05 р<.05 p«.0 
Self Acceptance 96 ЕТІ .89 99 99 
Social Maturity 98 E ‘92 ‘99 ‘99 
Self Security ‘97 86 `90 199 . ‘99 
School Affiliation 99 196 197 .99 .99 
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as defined above) is the total error variance across factor pairs and 
constitutes the upper limit of the error due to regression estimates. 

In the example of this paper the correlations between the re- 
gression estimates and the true factor scores are 190 for Self Ac- 
ceptance, .91 for Social Maturity, .88 for Self Security, and .90 for 
School Affiliation. Reference to the table of Guttman (1955) reveals 
that the maximally different sets of factor scores will have correla- 
tions in the low sixties with these estimates. Such correlations (low 
60’s) would be unacceptable and would have serious effects on ге- 
liability and usability of the factor scores. The coefficient, rep- 
lieability with its associated confidence limits reveals that the total 
error, both of regression and of sampling, does not exceed (1 — 
7,2) % of the variance. In the case of the factor labeled "Self 
Acceptance,” the best estimate is that (1 — .96%) or 7.55% of the 
total variability is accounted for by the difference between the 
samples and differences between true factor scores and regression 
estimates. It сап be said with 99% confidence that the between 
factor variance from all sources does not exceed (1 — 84) or 
29.1% of the total variability. It may be observed (99% confidence) 
that for the factor labeled “School Affiliation” the between factor 
variance will not exceed (1 — .96°) or 7.1% of the total variance. 
This finding, when compared with the correlation of .90 between 
the true factor scores and the regression estimates, suggests that, 
for the normal range of scores, the loss due to using regression esti- 
mates of the true score may not be as great as has previously been 
believed. While not questioning Guttman’s derivation, it may be 
suggested that the deviations of regression estimates from true 
factor scores may be systematic rather than random. A procedure 
for partitioning the unexplained variance between matched factors 
will be presented in a later article. 

A clear conclusion of this study is that the Primary Level (Form 
A) of the Self Observation Scales has a factor structure that can 
be replicated on independent subsamples of children for which the 
instrument was designed, and that the level of stability of factors 
derived from various random subsamples is high. Derived from 8 
sample comprised of children of different races, ages, socio-economic 
advantage, and geographic regions, the results of this analysis 
increase the confidence with which the obtained factor structure 
can be utilized. 

The general approach of combining RSSR strategies with factor 
analytic procedures and computing a coefficient of replicability 
yields a more stringent and informative test of invariance than 49 
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the approaches which rotate two or more structures to maximum 
congruence. In addition, the original rotational criteria are not 
violated in the process of trying to manufacture congruence (order 
can often be imposed on a chaotic system, but if it is forced, the 
resultant similarity between the obtained “order” and nature may 
be purely coincidental). 

In combination, RSSR and the coefficient of replicability can 
contribute significantly to the confidence with which factors are 
identified and reported. One result of applying this strategy across 
random subsamples may be a substantial reduction in the number 
of reported factor analytic studies. 


Summary 


1. Demonstration of factor stability (invariance) is prerequisite 
to the legitimate use of such factors in scientific inquiry. 

2, Methods for obtaining and estimating factor invariance through 
post hoe rotational techniques aimed at producing a high degree of 
configurational similarity between factor loading patterns from 
various sets of data (e.g., Browne and Kristoff, 1969) present several 
problems. 

a. The original rotation criteria are abandoned. 

b. Similarity coefficients derived from the use of these tech- 
niques with random data capitalize on chance elements and often 
produce factor similarity coefficients in the same range as have been 
reported with real data. 

с. Many rotational schemes which operate on factor matrices 
ignore the error introduced in the common factor model by differ- 
ences between the true factor scores and the regression estimates 
of the factor scores. 

3. The coefficient of replicability reflects both error due to sam- 
pling and error due to the indeterminacy of the common factor 
model. 

4. Random subsample replication techniques provide a method 
for computing a series of coefficients of replicability across paired 
tandom subsamples. 

5. The coefficient of replicability provides a method for estimating 
Stability across any number of replications, stating confidence limits 
Concerning the replicability of each factor. 

6. The proportion of error introduced by the difference between 
regression estimates of the factor scores and the true factor scores 
Will not be more than 1 minus the square of the coefficient of repli- 
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cability. Confidence limits for the true value of 7, are easily 
obtainable. 

7. The correlation between true factor scores and regression 
estimates of these scores does not set an upper limit on the coeffi- 
cient of factor replicability. 

8. The factor structure of the Primary Self Observation Scales 
is replieable over random subsamples. 

9. RSSR strategies are appropriate to the demonstration of 
factor invariance in most factor analytic studies. An adequate 
sample of data for factor analysis is usually adequate for one of 
the RSSR strategies. 

10, The routine use of RSSR strategies and reporting of both 
standard coefficients of replicability and invariance and their con- 
fidence limits would greatly enhance the ability to make inferences 
from such studies and would probably improve the average quality 
of factor analytic studies reported in the literature. 
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DOMAIN VALIDITY AND GENERALIZABILITY! 
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An alternative derivation of Tryon’s basic formula for the co- 
efficient of domain validity or the coefficient of generalizability 
developed by Cronbach, Rajaratnam, and Gleser is provided: This 
derivation, which is also the generalized Kuder-Richardson coeffi- 
cient, requires a relatively minimal number of assumptions com- 
pared with that in previously proposed approaches. Ў 


Тнів note provides an alternative derivation of the basic formula 
for Tryon's (1957) coefficient of domain validity or the coefficient 
of generalizability offered by Cronbach, Rajaratnam, and Gleser 
(1963). It appears that the writers’ treatment affords a more nearly 
assumption-free development for these coefficients (which are also 
the generalized Kuder-Richardson coefficient) than that which has 
been proposed previously. 

The most fundamental notion in statistics is that of making 
inductive inferences about the characteristics of a population of 
individuals on the basis of observations on a sample of individuals 
from the population. Similarly, but not usually made so explicitly, 
the fundamental problem of psychometries is to study the nature 
of a domain or universe of variables on the basis of observations 
9n a sample (or selection) of variables from the domain or universe. 
(Tryon’s “domain” is Guttman’s or Cronbach’s “universe.”) Thus, 
if there exist n individuals and p variables, statistical inference is 
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concerned with what happens as m becomes very large, whereas 
psychometric inference is concerned with what happens as p be- 
comes very large. It is in the context of psychometric inferences 
that the confidence one may have in the observed score on a test 
is to be considered. 

Тһе observed score, 2, on а test may be taken as the sum of the 
p observed item scores: 


t= Ба, @ 


where z; is the score on the jth observed item. Similarly, that with 
which the writers are most fundamentally concerned, the domain 
score, т, is | 


2- База, @ 


the sum of the z;, the scores оп the p observed items plus the sum 
of the z,, the scores on the q remaining hypothetical (unobserved) 
items in the domain. 

To measure confidence in the observed scores as estimates of the 
domain scores, the correlation R between 2 and z is obtained; clearly; 
if this correlation is high, the observed score is essentially the desired 
domain score, whereas if the correlation is low, any inferences about 
z and $ are to be viewed with suspicion. 

Elementary calculation yields the squared correlation R^ between 
the observed score (1) and the domain score (2): 

Ra pV itpp—DCn+pqC;, р ——À 

- [pV;o-p(p— Dp; +p- 1)644-2pqC;,-- qV.--q(q— uw 


where Ў; is the mean variance of the observed items, б; is the mean 
covariance between the observed items, V, is the mean variance 0 
the unobserved items, б. is the mean covariance between the ob- 
served items, and б;, is the mean (cross) covariance between the 
observed items and the unobserved items. To clarify these symbols, 
typical elements of the variance-covariance matrix of all p + 4 
items in the domain are illustrated in Figure 1. : 

To simplify the algebra, 4 has been kept finite thus far; consistent 
with the notion that the number of potential items in a domain 18 
indefinitely large, one may take the limit of (3) as q > о, and thus. 
obtain: 

R= PO: (9) 
[pV; + рр — DC4]C.. 
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observed 
items 


unobserved 
items 


|1-----5-----:|4--5---,) 


| 1— p — | 1a 


observed unobserved 
items items 


Figure 1. Typical elements in the symmetric variance-covariance matrix 
of the items in a domain. The subscripts j and k refer to observed items; the 
subscripts s and £ to hypothetical unobserved items. 


_Itis not possible to evaluate (4) because of the unknown terms 
Cı, and C,,. Thus the crucial point of the development is reached; 
certain "assumption(s)" about these unobserved quantities must 
be made. For this purpose, one may take 
ea = CAG A (5) 
When (5) is substituted in (4), it is found that 
S 
C. 
Rice De (0) 
pV; + рф — 1C; 
the basic formula in terms of observables for the squared coefficient 
‘of domain validity. This formula may look more familiar if one 


Temembers that its denominator is simply V, the variance of 2, and 
that С, = (y — Ð Vj/(p(p — 1). Hence (6) becomes 


Е* = alpha - Cz T 5”) , (7) 
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Cronbach’s (1951) classic coefficient alpha, the generalized Kuder- 
Richardson formula 20. The square root of (7), R itself, is then the 
correlation between 2 and z—Tryon's coefficient of domain validity. 

The writers have just engaged in one of the favorite indoor 
sports of psychometricians: producing a new development for the 
Kuder-Richardson formula, However, to qualify for serious consid- 
eration, new derivations must surely additionally obey the rule of 
making less restrictive assumptions than previous treatments; as 
is well known, this classic formula has a history of being developed 
with unnecessarily restrictive assumptions. Thus, one should look 
carefully at the crucial assertion of equation (5). 

First, let the assumptions that have not been made be explicitly 
stated. Nothing has been said about the individual means, variances, 
and covariances of the items, or about the internal structure of the 
items. Additionally, there has been no suggestion that the unob- 
served items are “parallel” to the observed items; none of the 
so-called “equivalence” assumptions is set forth. 

There is one key statement in (5) concerning the cross-covariance 
between the observed and the unobserved items: the mean cross- 
covariance is equal to the geometric mean of the two mean within- 
covariances. It is suggested that (5) is not really an assumption, but 
may be viewed as perhaps the simplest possible definition serving 
to link the unobserved items as belonging to the domain inducible 
from the observed items. 

In fact, it is suggested that (5) may be the only reasonable defini- 
tion in this context, for anything else appears to be unreasonable. 
If one considers taking the inequality C;,^ < Сие as a definition, 
it is clear that the association between observed and unobserved 
items is generally lower than that within each item set separately— 
an outcome which surely denies that the hypothetical unobserved 
oe belong (in general) to the domain implied by the observed 
items. 

On the other hand, if one took the inequality €, > C аба 006 
would be saying that the association between observed and unob- 
served items is generally higher than that within each item set 
separately. In any case, asserting that б), > ОС, is essentially 
impossible, for it is clear that such a statement would soon rende! 
the domain item covariance matrix in Figure 1 indefinite. Indee* 
it um to calculate the maximum possible C^ by letting R° = 
in (4): 


мах (0,2) = (2) оло, + (eco. 006 
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а quantity which rapidly reduces to that of the writers’ basic 
definition (5) as the number of observed items becomes large. 

The reader may have the impression that the above treatment 
consitutes a derivation of the formula for the generalized Kuder- 
Richardson reliability coefficient. This impression is not true, for, 
until this point, nothing has been said explicitly about reliability 
of measurement. The development has been entirely in the context 
of Tryon or Cronbach’s profoundly simple and profoundly com- 
pelling concepts of domain validity or generalizability, Substantial 
effort suggests that it is impossible to derive (6) or (7) as a re- 
liability coefficient without bringing to bear the ominous equivalence 
assumptions that have plagued the classic theory of reliability. 
Tryon (1957), for once bowing to tradition, first developed R? as a 
reliability coefficient and was forced to invoke the usual restrictive 
assumptions; he then proceeded to derive his domain validity 
coefficient, *. . . a statistic that is more meaningful than the reli- 
ability coefficient . . . ,” from his earlier development of the re- 
liability coefficient. In contrast, but following Tryon’s algebra 
closely, the writers һауе gone directly to his “more meaningful” 
coefficient without stopping along the way to pick up the excess 
baggage of the conventional assumptions of classic reliability 
theory. 
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Responses to the Kuder Oceupational Interest Survey by 3983 
males from nine occupational groups provided the data to compare 
the accuracy of four scoring strategies: lambda coefficients, cur- 
rently used by the test publisher, chi-square weights and two 
applications of multiple discriminant analysis. An analysis of 
variance test of the percentage of individuals correctly identified 
by each technique indicated that no statistically significant differ- 
ences existed between the strategies. The study therefore provided 
empirical evidence supporting the continued use of the lambda 
weighting procedure for the scoring. of the Kuder OIS. 


Improvine the discriminatory accuracy of interest surveys has 
been a problem of concern to measurement theorists for many years. 
One approach taken to increase the effectiveness of existing instru- 
ments has been the development of new scoring techniques (e.g., 
Findley, 1956; Porter, 1967; Kuder, 1966). As a consequence, a 
number of promising scoring procedures have been suggested but 
efforts to identify the optimal method for a given instrument have 
been lacking. 

The purpose of the present investigation was to compare the 
effectiveness of four strategies for scoring the Kuder Occupational 
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Interest Survey Form DD. More specifically, techniques which were 
considered included: Lambda coefficients, the procedure currently | 
used by the publisher; chi-square weights as developed by Porter 
(1967); and multiple discriminant analysis using occupational 
scores generated by (a) lambda coefficients, and (b) chi-square 
weights. Loadman (1972) conducted a similar study which con- | 
sidered a pattern analytic approach rather than discriminant anal- | 
ysis with chi-square occupational scores for the scoring of the 
Kuder 018. His comparisons of the four scoring techniques may 
have been misleading, however, sinee in his study some of the 
procedures were cross-validated while others were not. In addition 
to correcting the problems of cross-validation, the present study 
also provided for a re-analysis. 


Format of the Kuder OIS 


The Kuder OIS attempts to identify occupational interests by 
analyzing responses to one hundred items, each consisting of a set 
of three activities. Each triad representing an item is presented 
to the individual in the format shown in Figure 1. From each group 
of activities the respondent is instructed to select the activity he 
prefers most and that which he likes least. The two responses рег 
item can be then summarized by one of the following patterns: 
1-5, 1-6, 2-4, 2-6, 3-4, or 3-5. Scoring the instrument is problematic 
since every response is correct if answered sincerely. 


Scoring Procedures 
Lambda Coefficients 


To solve the scoring problem for the OIS, the publishers (Science 
Research Asociates (SRA) currently use a lambda coefficient as 8 
measure of the similarity of interests between an individual and а 
particular occupational group. The technique was developed by 
Kuder (1966) based on the research of Clemans (1958) who had 
suggested that the relationship between an item and a criterion 
could be measured by the ratio of a point biserial to the maximun 
point biserial correlation. In calculating the point biserial correla- 


Figure 1. 
BAS Most Least 
Activity 1 (1) 0 (4) 0 
Activity 2 (2) 0 (5) 0 


Activity 3 (3)0 (6) 0 
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tion the dichotomous variable is the selection or nonselection of 
the 600 possible response patterns, while the continuous variable 
is the proportion of the criterion group selecting each of the 600 
response patterns. Thus the correlation is computed between a 
vector of 600, 178 and 0% corresponding to the selected response 
patterns of the individual and a vector of 600 proportions each 
associated with a separate response pattern. The maximum point 
biserial correlation is computed based on the selection of the highest 
response pattern proportion for each item across all 100 items, The 
ratio of the two correlations produces the lambda coefficient which 
has an upper limit of 1.00 and is unaffected by the homogeneity of 
the particular eriterion group. 

For each occupation a set of lambda weights is calculated. The 
computational formula used in deriving the lambda weights for a 
particular item (i) and response patterns (j) may be presented as: 


BE 


10 (1) 
> max, (Pa; — X) 


Хи = 


where P,; is the sum of the proportion of individuals in a particular 
criterion group who select the activity most liked plus the proportion 
of the individuals selecting the activity least liked which make up 
response pattern (j) for item (4). X is the average value of the Р;; 
Across all 600 possible response patterns, 


— dci У i 
Х- 600 = .667. 


The denominator of equation (1) сап be rewritten as Б,  max;P;; 
— 100Х. That is the largest Р;; for each of the 100 items are summed 
together and reduced by a factor of 100(.667) or 66.7. It should be 
Doted then, that the denominator of the equation remains constant 
lor the computation of all 600 lambda weights for a particular 
Criterion group but could vary across criterion groups. The cemputa- 
tion of the lambda weight is such that the sum of 100 A;;’s results in 
the lambda coefficient. 

The Similarity of an individual's interests with a particular oc- 
Cupational group is estimated by the lambda coefficient. The oc- 
"Upational group in which the individual lambda coefficient is the 

hest is designated as the most compatible group. 
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Chi-Square Weights 

Another approach for developing item response pattern weights 
for the Kuder OIS was suggested by Porter (1967). Rather than 
considering one occupational group at a time, as in the lambda 
procedure, the chi-square technique examines the responses made 
by several criterion groups. For each of the 100 items on the instru- 
ment, a contingency table consisting of a simultaneous breakdown 
of subjects by occupations and response patterns is constructed. The 
weights assigned per response pattern per occupation for each table 
are caleulated using the following formula: 


8-89) 


where t=1...Iandj=1...J; I denoting the number of occu- 
pations and J denotes the number of response patterns. zy is the 
number of responses made by the i-th group to the j-th response 
pattern, X; is the total number of subjects in the i-th group, Y; is 
the total number of individuals selecting the j-th response pattern 
and 3X; is the total number of subjects in the sample. 


Thus for each of the occupations considered, a fractional weight ' 


is calculated for each of the possible response patterns. An indi- 
vidual's score for an occupation is the sum of the chi-square weights 
associated with the individual’s responses to the 100 items. The 
higher the total score, the greater the similarity in interests between 
the individual and a criterion group. 


Multiple Discriminant Analysis 


Still another approach to the scoring problem, and one which has 
been used with considerable success in other classification problems, 
is the application of the linear discriminant function. In the present 
study, this technique was applied using two different sets of vari- 
ables. The first set of variables was the occupational scores obtained 
following the lambda weighting procedure and the second set of 
variables was the occupational scores developed from the chi-square 
weighting procedure. Following the identification of the best linear 
combinations of these variables, classification of an individual was 
obtained by the simple d? function, i.e., for each occupation the 
sum across functions of the squared deviations of an individual's 
composite score from the mean composite score. The respondents 
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were classified as belonging to the occupational group corresponding 
to the smallest d? value. 


Sample 


To develop and test the effectiveness of the four scoring pro- 
cedures described, the item responses of 3893 males, from nine 
unequally sized occupational groups, to the Kuder OIS were 
analyzed. These responses were originally collected by Кидег 
while developing the instrument and later obtained by Porter for 
the development of the chi-square scoring procedure. This data 
was also used by Loadman in his analysis. Although the data are 
probably too old for use in developing scoring keys for present day 
use, they did serve as an adequate base for comparison of the four 
scoring procedures. 

The occupational groups selected for study included: pediatricians, 
veterinarians, physical therapists, x-ray technicians, optometrists, 
clinical psychologists, social workers, foresters, and auto mechanics. 
The first five groups were designated as set I and were considered 
as similar occupations, while the last four groups plus optometrists 
were considered as dissimilar and labeled set II. Since optometrists 
appeared in both sets of data, the two sets were not independent. 
Each occupational group was randomly divided into two halves, 
A and B; thus two independent groups of data were available for 
each set, To obtain an estimate of the “true” effectiveness of each | 
scoring procedure, a double cross-validation technique as suggested 
by Mosier (1951) was followed. 


Analysis 

To compare the results of the four Kuder OIS scoring procedures, 
an analysis of variance for mixed models was utilized. The de- 
pendent variable in the study was the average (across halves A 
and B) percentage of correctly identified individuals for an occupa- 
tion. The problem of nonindependence between sets was solved by 
using the cross-validated results of half A for optometrists in set I 
and the cross-validated results of half B in set II. The fixed inde- 
pendent variables were: sets, similar or dissimilar occupations, 8; 
Measures, lambda coefficients or chi-square weights, M; and dis- 
criminant analysis or not, D. All three fixed independent variables 
Were completely crossed with each other. Occupation was treated 
as a random independent variable which was nested within S, with 
five levels per nest, but crossed with the two scoring procedures D 
and М. 
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Although occupations were not actually selected at random from 
a larger pool of occupations, it was felt that by using the Cornfield- 
Tukey bridge argument (Cornfield and Tukey, 1956) the results 
of this study could be generalized to similar occupations. Had occu- 
pations been treated as fixed, greater power would have resulted in 
the analysis since the individual test respondent would have been 
the unit of analysis rather than occupations. The results of such 
а test, however, would have had very little practical value. 

The hypotheses tested included differences in the discriminatory 
accuracy: between sets of occupational groups, between measures, 
chi-square and lambda techniques, and between using multiple 
diseriminant analysis or not. Interaction effects between measures 
and diseriminant analysis as well as interaction effects with sets 


were also tested. Each hypothesis was tested for statistical signifi- 
cance at a = .05. 


Results 


The average percentage of individuals in half A and B correctly 
classified into their actual occupational groups based on the cross- 
validated scoring keys are presented in Table 1. The cross-validated 
average percent correct classifications in the present study were 
compared with Loadman’s (1972, p. 114) quasi-cross-validated 
results. As was expected the cross-validated results were in each 
case smaller. For the lambda procedure the shrinkage due to cross- 


TABLE 1 


The Average Percentage of Individuals Corre i i ) 
e cily Classified Into Their Actua 
Occupational Group from Half A and Half B for Set I and Set 11 


хар Non-Discriminant 
Discriminant Analysis Analysis 

Chi-Square Lambda Chi-Square Lambda 
Weights ^ Weights ^ Weights ^ Weights 


XEM Xs 62.54 54.19 60.10 63.05 

вакт Нау Technician 0 52:04, — 30.61 43.08 54.87 

rediatrician j 57.47 44.14 65.28 67.58 

M "Therapist 42.91 42.19 38.06 49.85 
eterinarian 78. | 

се 8.56 71.98 88.40 60.11 

Psychologist, 71.40 0 

Auto Mechanic 83.34 M бу ae 11:00 

Set II Forester 76.30 68.70 4356 78.04 

Optometrist 78.00 51.50 38.00 75.00 

Social Worker 53.21 51.63 53.88 70.06 


азан. Ue РН ЕВА 


qmm. 
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TABLE 2 
ANOVA Table for Mixed Models, Analyzing the Data from Table 1 


Degrees of 
Sources Freedom | Means Squares F p 
8 1 1449.86 2.51 15 
0:5 8 578.07 
D 1 126.59 1.47 26 
M l 48.44 .48 52 
SXD 1 28.06 .33 59 
SXM 1 85.73 85 39 
DXM 1 625.84 3.55 10 
SXDXM 1 76.95 53 
OXD:S 8 85.97 
OXM:S 8 100.90 
OXDXM:S 8 176.51 


validation was 5.63 in set I and 1.27 in set II. For the discriminant 
analysis procedure based on lambda scores the shrinkage was 11.26 
in set I and 14.70 in set II. Thus a great deal of bias had entered 
Loadman’s study when he compared his non-cross-validated lambda 
based results with his cross-validated chi-square based results. 

The ANOVA table associated with the data in Table 1 is pre- 
sented in Table 2. The unweighted average percentage of individuals 
correctly classified in sets I and II were 56.37 and 68.41 respectively. 
Although the difference in percentage correct identifications between 
sets was in the predicted direction, the null hypothesis of no differ- 
ence between sets was not rejected, p < .15. A comparison of the 
average correct classification rates (across the four scoring tech- 
niques) among optometrists in set I and set II provided a further 
test of the hypothesis. Among the homogeneous occupations, optome- 
trists were correctly identified an average of 60.04% of the 
time, while among heterogeneous occupations, 61.87% of the in- 
dividuals were correctly classified. This comparison suggested that 
for the techniques investigated, discrimination among similar occu- 
pations was, practically speaking, as accurate as among dissimilar 
groups. 

The unweighted average percentage of individuals correctly iden- 
tified when multiple discriminant analysis procedures were used was 
60.62, while non-use of this technique produced 66.66% correct 
classifications. This difference was not statistically significant, 
P < 26. A comparison of the measures, lambda vs. chi-square 
showed that for the former an average of 61.29% of the individuals 
Were correctly classified while for the latter an average of 63.49% 
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correct classifications were made. The null hypothesis of no differ- | 
ence between measures was not rejected, p < .52. | 

An average correct classification rate of 65.67% was made with - 
the discriminant analysis based on chi-square scores; 55.56% correct 
classification rate was obtained with the discriminant analysis based 
on lambda occupational scores; 61.32% correct classification rate 
was obtained based on the chi-square scoring techniques used alone | 
and 67.03% correct identification rate was obtained when the - 
lambda coefficient was used alone. Again the analysis did not ' 
indieate a statistically significant interaction, p « .10. Finally no 
interaction effects with sets, S, were identified. 

A further consideration in deciding which scoring technique was 
the most, effective was consistency in the accuracy with which the 
procedures correctly classified individuals across occupations. A 
technique which correctly identifies individuals in one or two occu- 
pations at a very high rate but classifies individuals in other 
occupational groups at low rates may not, in the long run, be as 
valuable as a procedure which consistently classifies individuals at 
a moderately high rate over all occupations. The hypothesis of 
equal consistency of accuracy across occupations was tested by 4 
two-way analysis of variance (occupations by scoring procedure). 
The dependent variable was the absolute value of the difference 
between the average percentage of individuals correctly classified 
and the mean average percentage correct classification rate within 
each scoring procedure (Levene, 1960). The standard deviation of 
the average percentage of individuals correctly classified as well аз 
the averages of the absolute values of the deviations from the means 
within each scoring procedure are presented in Table 3. The null 
hypothesis of equal consistency of accuracy among the four scoring 
procedures was not rejected for set I. Using the same test with 


TABLE 3 


The Standard Deviation and Absolute Error Difference for Each Strategy Used in 
Levene's Test for Equality of Variance Across the Four Measures 


C Lou iS нері 
Discriminant Discriminant 
with Chi-Square with Lambda ^ Chi-Square Lambda 


Standard Deviation 13.16 15.51 17.87 7.61 

Average ё 9.33 11.57 14.73 5.51 
А Set И 

Standard Deviation 11.57 13.26 23.40 4.58 


Average 2 8.12 10.75 22.20 3.57 
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set II, however, the null hypothesis was rejected. The Scheffé post 
“hoc test indicated that the chi-square procedure was less consistent 
than either the lambda technique or the discriminant procedure 
based on chi-square occupational scores. 


Summary 


The results of the present study as presented in Tables 1 and 2 
indieate that no one scoring procedure was superior to the others 
across all of the occupational groups considered. The lambda 
weighting procedure, however, correctly identified individuals as 
belonging to their actual occupational group at the highest rate in 
six of ten occupations, the chi-square technique in three of ten, the 
diseriminant analysis based on chi-square occupational scores in 
one of ten. The discriminant analysis technique based on the lambda 
occupational scores did not have the highest rate of correct classifi- 
cation in any of the occupations studied. In addition, the four 
scoring procedures did not differ on the criterion of consistency of 
accuracy for similar occupations, but the lambda and the discrimi- 
nant analysis based on the chi-square scores procedure were more 
consistent than the chi-square procedure for the dissimilar occupa- 
tions. 

Considering that the discriminant analysis procedures are more 
difficult to calculate and have no greater accuracy than the lambda 
procedure, and since the latter had greater consistency than the 
chi-square technique, the lambda procedure seems preferable. Fur- 
thermore, the lambda coefficient is also preferable to the chi-square 
procedure since lambda weights are not a function of the оссира- 
tions being compared while the chi-square weights for an occupa- 
tion depend on the other occupations in the set being compared. 
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The problem of selecting the size of the tails from а sample 
drawn from a distribution of test scores for item analysis study 
is discussed. In the past the selected tails were viewed as inde- 
pendent samples. However, this is not the case. The tails are 
dependent. Given this we find that each tail should contain about 
2195 of the sample and not the traditional 27%. Use of 27%, 
however, is not far from optimal. 


In the traditional item analysis situation a test is administered 
to N subjects, the resulting scores are arranged in order of magnitude 
and the tests of these subjects whose scores fell in either the upper 
or lower tails of the distribution of test scores are further studied 
item by item. A question that arises here is how large should the 
tails be—that is, what is the appropriate 4 (0 < q < .50) such 
that the upper and lower tails to be selected for further study each 
contain q of the cases? 

An argument for determining 4 runs as follows. Assume the 
underlying distribution of test scores is normal and the sample 
size is large. Let X, and Х, be the means from the upper and lower 
tails of the sample respectively. Then 4 should be determined so 
that the critical ratio J 
uod nde n 

SE(X, — X) 
is maximized. Here SE(X, — Xz) is the standard error of X, — Хз. 


Now CR of (1) will vary from sample to sample due to sampling 
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fluctuations. So it is more appropriate to first take expectations 
and then maximize. This means for large samples the maximizing of 


= @/ 

BOR) = эў, +) % 
where f is the unit normal ordinate at the upper standardized 
baseline point z which separates the tail from the rest of the 
distribution and о is the standard deviation of the underlying distri- 
bution of test scores. That is, z and f are defined by 1 


> аа @) 
and 
jet. ГІ 


In previous works (Cureton [1957] and Kelley [1939]) it was im 
plicitly assumed that the upper and lower tails constitute inde 
pendent subgroups. However, this is not the case. The upper tail 
observations are correlated with the lower tail observations even for large 
samples (Mosteller, 1946). Because of this the standard error of 
im ЖУ displayed in the above references is incorrect and the 4 
which maximizes the critical ratio is not .27 as was previously 
eved. Ав we show below it is about .21. 
Using the method of Chernoff, Gastwirth and Johns (1967) the 


correct, standard error can be shown to equal, asymptotically (i.e; 
for large samples), 


SE(X, — X,) = (5) 


5 
avn V4 


where 


A = 24 + 24 + Zü — 29) — Qf 4:20 — 29y..— (9 
The term to maximize is thus 


BOR) = 2 VF 0 
or equivalently the term to maximize is 
af 9 
Ч ( 


This last term was computed for 9 = .01(.01).50. It was found 0 
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attain its maximum around 4 = .21 and .22. In particular some 
entries are 


ig 4/A. 
.19 1.9293 
.20 1.9325 
21 1.9338 
.22 1.9336 
.23 1.9320 
27 1.9147 
.50 1.7519 


Using four-point interpolation (8) attains its maximum for 9 to 
three decimal places at 9 = 215 and here 4f2/A = 1.9339, The 
optimal q is actually slightly closer (і.е., consideration of fourth 
decimal place) to .21 than .22. Thus instead of the 27% rule, more 
appropriately it is the 21% rule. However, as is clear from the 
above entries little is lost by using q = .27. 

The arguments of Cureton (1957) and Kelley (1939) would be 
correct if instead of selecting the tails of a sample, two independent 
samples were drawn from the tails of the complete distribution of 
test scores. This, of course, is usually impossible for the actual 
distribution is unknown. The best that is possible would be to select 
independent samples from the tails of some concomitant variable 
where we can view the test scores as a dependent measure. This 
approach is exactly the extreme group approach discussed by Feldt 
(1961). In this case the optimal tail size is around .27 if the correla- 
tion between the concomitant variable and the test scores is small 
(1.е., around .10). Under normality this is implying independence. 
As the correlation increases the optimal tail size decreases. From 
above it appears to follow that if the concomitant variable and the 
test scores have correlation one then the optimal tail size is round 
.215. 

We should also mention that the results of Ross and Weitzman 
(1964) do not disagree with our results. For they show that the 
optimal size of the tails (under bivariate normality) for estimating 
the tetrachoric correlation is .27 when again the variables are in 
fact uncorrelated. The optimal tail sizes differ when the correlation 
is not zero. 2 f 

Finally, it should be stressed that the above demonstration raises 
а serious question regarding the techniques presently employed for 
item discrimination analysis based on answers to an item in high 
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and low criterion groups. The techniques are the Chi-square fe 
in а four-fold table or the two sample t test. The present tes 
assume two independent groups. Ав indicated above this is not 8 
Either the validity of the tests need to be demonstrated or пе 
techniques taking into account the dependency need to be producet 
The present authors are working towards these. 


REFERENCES 


Chernoff, H., Gastwirth, J. L. and Johns, M. V. Asymptotic distribu 
tion of linear combinations of order statistics. The Annals 0 
Mathematical Statistics, 1967, 31, 52-72. 

Cureton, E. E. The upper and lower twenty-seven percent 
Psychometrika, 1957, 22, 293-296. 

Feldt, L. S. The use of extreme groups to test for the presence of 2 
relationship. Psychometrika, 1961, 26, 307-316. у 

Kelley, Т. L. Тһе selection of upper and lower groups for the valida 
zion of test items. Journal of Educational Psychology, 1939, 30 

Mosteller, F. On some useful “inefficient” statistics. The Annals 0 
Mathematical Statistics, 1946, 17, 377-408. 1 

Ross, J. and Weitzman, В. А. The twenty-seven percent rule. Th 
Annals of Mathematical Statistics, 1964, 35, 214-221. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
1975, 35, 51-66. 


A MODEL FOR PSYCHOMETRICALLY DISTINGUISHING 
APTITUDE FROM ABILITY 


SUSAN E. WHITELY лхо RENÉ V. DAWIS! 
University of Minnesota 


It is widely agreed that current ability measures reflect a com- 
plex interaction of environment with genetic potential. This leads 
{о a basie measurement problem since persons with the same 
measured ability may vary widely in potential due to nonequivalent 
learning opportunities. The purpose of this paper is to present a 
model which may hold some promise in psychometrically distin- 
guishing ability (current status) from aptitude (potential). Data 
on spatial reversal performance are analyzed according to the 
model to illustrate how some of the practical problems may be 
solved, 


Iv is widely agreed that current ability measures reflect a complex 
interaction of environmental and genetic factors. The literature on 
general intelligence, for instance, has unequivocably demonstrated 
that test performance is highly influenced by membership in a 
culture or subculture such as a race or socio-economic class. Al- 
though there still is much controversy over whether these sub- 
cultures differ genetically (Herrnstein, 1971; Jensen, 1969), it is 
known that exposure to more advantageous environments can in- 
crease IQ (сі. Lee, 1951). Thus, with general intelligence: (and 
probably many other abilities) the particular learning experiences 
and opportunities an individual encounters has a significant in- 
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fluence on his measured ability. This leads to а basic measurement 
paradox since individuals who show the same measured ability may 
have different potential if their learning experiences have varied 
widely. The problem is to find a method of determining which 
individuals have undeveloped potential so that they may be pro- 
vided more effective educational experiences or subjected to more 
appropriate selection criteria. 

Тһе major purpose of this paper is to present а method which 
psychometrically distinguishes between ability (current status on 8 
test) and aptitude (potential). Tests measuring cognitive ability 
factors have not, as yet, been developed to the point where the 
general utility of this approach сап be assessed. However, data on 
a psychomotor ability will be presented to illustrate the feasibility 
of the approach and to suggest ways to solve some of the more 
practical problems in the application of the technique. 


А Conceptualization of the Relationship о) Ability to Aptitude 


At the conceptual level, it is suggested that ability can be defined 
as current status and aptitude сап be defined as potential status 
under conditions optimally favorable to the development of the 
ability. Direct assessment of aptitude, then, would imply equally 
favorable learning experiences for all individuals. Obviously, this 
is not the case, but it is theoretically possible to distinguish between 
individuals having the same ability but different aptitudes by 
directly Measuring the modifiability of ability. That is, when two 
Hd ios within the same current level of ability are given equiv- 
a kin intervention (e.g. practice), the individual with the greater 
ap itude should show a faster rate of change in ability than the 
individual with lesser aptitude, 

РО to consider this general conceptualization of 
biais iic in the context of previous research, since 
телу iens ors have found the measurement of change to be 
EL А ce complex, but of very questionable utility. 
ds ) finding. that improvement over practice is not 

e same as learning ability has been largely unrefuted by sub- 
sequent investigations (see Cronbach and Snow, 1969, for sum- 
mary). Woodrow found gain to be both task-specific and not cor- 
related with a general ability factor, Since gain scores do not seem 
2 SCR learning ability, it would appear that measuring mod- 
i ani n not be a useful technique in discriminating aptitude 


However, this conclusion шау be premature because correlation 


Е 
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with gain scores apparently depends on two important variables. 
First, the stage of practice over which gain is computed may mod- 
erate the relationship between gain score and other variables. Both 
Woodrow (1939) and more recent studies (e.g., Dunham, Guilford 
and Нберіпег, 1968) have found that the factorial composition of 
& task may change over stages of practice. Gain at different stages 
of practice, then, may be expected to depend on different abilities. 
It is interesting to interpret the Woodrow (1939) study from this 
perspective, Woodrow used 66 trials for four practiced tests and 
found the final scores to be less correlated with cognitive factors 
than the initial scores. With long practice periods, subjects ap- 
proach the mastery level and, as Jones (1970a) has noted, the 
variation among subjects at this level becomes increasingly due to 
error (unreliability). Both unreliability and motor ability may have 
accounted for a large share of the final task variation among 
Woodrow's subjects. More importantly, gain computed over such 
в long interval of practice could be expected to reflect task-specific 
mastery components rather than a more general learning ability. 

A second important moderator of correlation with gain scores is 
the relationship of gain to initial status. Gain correlates negatively 
with initial status in most learning tasks, so that those with the 
least efficient initial performance gain the most. Variability in gain 
scores, then, reflects two confounding factors, stage of practice 
and initial status. 

A successful distinction between aptitude and ability will have 
to account for these two important variables. Stages of practice 
should be carefully studied so that gain can be measured during 
those stages most likely to reflect learning ability. Also, the con- 
founding effect of initial status should be controlled. The model 
to be proposed here directly controls the influence of initial status 
and uses a special technique, molar correlation analysis (Jones, 
1962; 1970a; 1970b), to investigate stages of practice. 

It is hypothesized that aptitude can be reflected by a linear com- 
bination of two measures, ability (current status) and modifiability. 
The simplest way of expressing this relationship is by the equation 
for a straight line, у = а + bz. The symbols іп the equation are 
defined as follows: (1) the constant, a, is the initial status on the 
ability test; (2) bx is the modifiability of the ability test scores; 
and (3) y is the aptitude when measured ability is at asymptotic 
value. Modifiability has two separate components. One of these, 
2, refers to either the graded quality or amount of intervention be- 
tween ability estimates, while b refers to the rate of ability change 
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observed. An individual's aptitude, then, is conceived of as a fluid 
quantity, characterized by both his initial status, a, and sensitivity 
to intervention, b. 

It follows, then, that in order to have a measurement which re- 
flects aptitude, it is necessary to have at least two measurements 
of ability, one before and one after a standardized intervention 
(fixed value of x). For prediction, ability and modifiability would 
be two different independent variables in a multiple regression 
equation, weighted according to their relative importance to the 
criterion to be predicted. 

This simple formulation, with the hypothesized role of gain as 
a predictor, has interesting implications. Cronbach and Furby 
(1970) have suggested that “residualized gain scores” rather than 
raw gain should be used if the goal is to identify individuals with 
undeveloped potential. Thus, only “unexpected” gain would be used 
to select such individuals. When raw gain is used in a regression 
equation with initial status, it can be shown that the beta weight 
for raw gain is linearly related to the correlation between resid- 
ualized gain and the criterion.? 

Similarly, the use of a regression equation also implies that raw 
gain need not be correlated with a criterion measure to yield in- 
creased predictability. Gain may function as a suppressor variable 
through its correlation with initial status. Theoretically then, gain 
as a modifiability measure can be used to correct initial status 
scores to provide a better reflection of aptitude. 


3 Consider the following regression equation: 


ті = Вот: + Barts, 


where 1 refers to the criterion deviation score, 2 refers to the deviation on initial 
status and 3 refers to the deviation of the raw gain score. The beta weight for 
raw gain can be computed as follows: 


— Тз — Tias 
Вз 1 "т Tos" 
and the part correlation of the criterion and raw gain, removing the effects of 


initial status (definitionally the correlation of the residualized gain score and the 
criterion) is as follows: 


, 


Tis — 7127 


s 


Then 
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There are several problems which directly touch on the feasibility 
of utilizing such an approach to the measurement of aptitude. The 
first, and most obvious, concerns the extent to which ability scores 
can show gains from very short intervention periods. Previous 
studies on coaching (see Anastasi, 1958, for summary) indicate that 
large gains can be made, and that there are individual differences 
with respect to gains. Apparently, students from the most deficient 
environments show the largest gains. It is not clear, however, 
whether this is due to the correlation of gain with initial status 
or if there are also larger unexpected gains for such a group. 

А second problem concerns the degree to which modifiability of 
the test score parallels the latent ability trait. Basically this ques- 
tion concerns the relationship between the asymptotie value ob- 
tained in the prediction and the latent aptitude. Practically speak- 
ing, in the long run, the degree of correspondence here will be 
determined by the extent to which modifiability scores lead to in- 
creased predictability of achievement. However, in the short run, 
there is a design problem with respect to the degree and nature of 
the standardized intervention. For instance, little correspondence 
between latent aptitude and asymptotie test score would be ex- 
pected when the intervention utilizes the same items that are used 
for final testing. The asymptotic test score would then depend more 
оп rote memory than on aptitude. Similarly, measuring gain over 
long intervention periods would be expected to be less correspondent 
to latent trait modifiability and more specific to the predictor. 

A set of related problems concerns the kind of rate measure to 
use to provide increased reflection of aptitude. The most critical 
of these problems concerns the relationship of rate measurements 
to the true shape of the individual ability curves. Most likely, this 
curve is S-shaped such that slope between any two points varies 
over the course of intervention. If initial status is near the bottom 
of this curve (large undeveloped potential), the instantaneous rate 
(derivative) will start out at a low level and then increase to a 
Maximum rate, followed by a decrease in rate until asymptotic 
value is reached. Thus, it is not necessarily the case that for in- 
dividuals at the same initial status, the one with the highest ap- 
titude will have the highest modifiability. It depends on what point 
of the curve is being observed. 

Figure 1 presents ability curves and observed rates of change for 
two hypothetical individuals. Two individuals may have the same 
observed rate of change (observed between two distance points 
Over intervention) when one has a decreasing instantaneous rate 
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True change over intervention 
---- Observed rate 


Person 2 


Person l 


Test 2 


Measured Ability 


Degree of Intervention 


Figure 1. Hypothetical ability curves and observed rates of change for 
two individuals. 


and the other an inereasing rate. The one with the increasing rat 
(Person 2) will reach the higher asymptotic level, but this will 
not be detectable by gain between these two points. 

; There are at least two possible approaches to this problem. One 
is to take several slope (rate) measurements over increasingly bette! 
interventions. This may be impractical because it is time-consuming 
for complex abilities and may pose insurmountable difficulties with 
respect to precision of measurement of ability scores. А secon 
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approach is to use only one segment of intervention, but to select 
this intervention segment such that the reflection of aptitude by 
modifiability is maximized in the measurements. In the section 


_ that follows, molar correlation analysis (Jones, 1962; 1970a) is 


used on psychomotor ability data to demonstrate how this selection 
may be maximized in a population of individuals. 

А final problem which may affect the utility of the proposed 
model concerns how rate is to be computed. If a single intervention 
session is used, i.e., observing ability only twice, there is no obvious 
unit to use on the abscissa. When rate is to be used in a regression 
equation with initial status, there are three possible ways of com- 
puting it: (1) slope, (2) score ratio and (3) gain score. To com- 
pute the first rate index, slope, some measurement of performance 
must be taken during intervention. It may be feasible to generate 
such measures, such as time spent in practice, but the interpreta- 
bility is not always clear. The second index, score ratio, can be used 
on tests that provide ratio scale measurement. Although no сиг- 
tently existing ability test provides ratio scale measurement, there 
are new scaling methods (e.g., Wright and Panchapakesan, 1969) 
which can potentially allow the computation of score ratios. The 
third possible index to use is raw gain score. Since the gain score 
is to be used in a regression equation, as proposed above, many of 
the objections to raw gain score are eliminated. 

However, no matter which index of rate is used, the reliability 
of these measurements from equivalent forms should be investi- 
gated during the development of the tests. To depend on tests de- 
veloped according to classical criteria of equivalent forms leads 
to paradoxes in the estimation of the reliability of rate scores. 
That is, gain is not independent of measurement error. 

Information on the general utility of this approach on complex 
cognitive abilities apparently must await further developments. 
However, data on a psychomotor ability are presented below to 
illustrate the relationship of modifiability to prediction and to sug- 
Best internal criteria for the selection of an appropriate segment 
of intervention. 


The Predictability of Spatial Reversal Performance 


Materials 

A task which requires spatial reversal ability, tracing a simple 
figure in a mirror-blind apparatus, was used to provide ability 
status and modifiability measurements. In the mirror-blind ap- 
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paratus, the only visual cues are completely reversed from normal 
eye-hand coordination tasks. This task has been shown to be highly 
influenced by experience, although individual differences do persist 
(P. W. Fox, personal communication). 

The figure to be traced for the predictor (status and modifi- 
ability) measurements was a “zig-zag” line that required reversals 
in only two different directions. Both reversals were at 45 degree 
angles. The criterion task to be predicted by these status and 
modifiability measurements was the tracing of a more complex 
figure, a six-pointed star, in the mirror-blind apparatus. The star 
was constructed such that the role of spatial reversal ability would 
be maximized and task overlap between the zig-zag line and the 
star with respect to specific reversals would be minimized. On the 
star, no two reversals were in the same direction at the same angle. 
Also, none of the reversals on the star was in the same direction as 
on the zig-zag line. To equate the role of motor speed on these tasks, 
the star and zig-zag line were equated for total number of reversals 
and distance between reversals. The resulting correlations between 
the predictor measurements on the zig-zag line and criterion mea- 
surements on the star should then be due to motor reversal ability. 
The general question asked of the data, then, is: does modifiability 
ona specific measure of an ability (similar to coaching on a test 
with homogeneous items) add anything to the prediction of a com- 
plex task assumed to load heavily on the ability? If so, then 
modifiability on the zig-zag line should add to predictability on 
the star, in the mirror-blind task. 


Subjects and Procedure 


The subjects were 49 college sophomores enrolled in elementary 
psychology courses at the University of Minnesota. Four subjects 
were dropped from the experiment: two because of equipment 
failure, one for exceeding the five minute time limit, and one for 
taking a drug known to influence psychomotor performance. 
| Each subject was given 10 successive trials on tracing the zig-2ag 
line in the mirror-blind apparatus. Immediately following these 
trials, the star was traced for one trial in the mirror-blind apparatus. 
Time, in seconds, was recorded for each trial. High scores, on both 
predictor and criterion, indicate inefficient, performance. 


Results 


Table 1 presents the means, standard deviations, and correlations 
between trials of the spatial reversal task on the zig-zag line. Tt 
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can be seen that both the mean number of seconds to complete the 
task and the variability decrease over trials. The correlations be- 
tween status on the predictor trials and the criterion are also pre- 
sented on Table 1. All correlations were significant (р < .05) except 
for Trial 1 (p = .06). The highest correlation for these status mea- 
surements was at Trial 4 (.52). 

An inspection of the intertrial correlations presented on Table 1 
shows that the correlations display superdiagonal form (Jones, 
1962). That is, as one moves down the columns of the correlations 
matrix or across rows to the left, the correlations increase in size. 
Adjacent measurements of reversal performance, then, correlate 
more highly than remote ones. Jones (1970b) has found this pat- 
tern to be the general rule over trials of practice, with the excep- 
tion of very simple psychomotor tasks. 

Table 2 presents а decomposition of the total correlation matrix 
of the ten predictor trials into rate and terminal process com- 
ponents, as suggested by Jones (1970a). Jones hypothesizes that 
for intertrial correlation matrices having superdiagonal form, the 
consistency of performance over trials is due to some combination | 
of a rate and a terminal process. The terminal process is defined as t 
the relative ordering of subjects when all have reached their 
terminal positions. The extent to which the rate process exists 
between trials indicates the extent to which individual differences 
in rate of change are contributing to the consistency of performance. 
The rate process is defined as the residual correlation between trials 
after partialing out the correlations due to terminal position. Jones | 
(1970a) suggests that the rate process is usually strongest during 
the early stages of practice and gradually decreases as the terminal 


TABLE 2 

Decomposition of the Intertrial Correlations into Rate and Terminal Process 
Components 

Trial 1 2 3 4 5 6 7 8 9 
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Note.—The terminal process appears above the main diagonal; the rate process below it. 
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process takes over in later stages of practice. Since true asymptote 
is never reached, Jones suggests that the last trial in the matrix 
be used to estimate terminal position. 

In Table 2, the part of the intertrial correlation due to terminal 
position (Trial 10), appears above the diagonal, while the rate 
process appears below it. It сап be seen that the terminal process 
is stronger late in practice, by the increase of correlations moving 
down the columns and across the rows. The rate process, on the 
other hand, is strong before Trial 4, but then becomes small and 
irregular after this point. Thus, the patterns of performance indi- 
cate that there are consistent differences between subjects before 
Trial 4, which are independent of their terminal positions, This 
indicates that subjects are changing at different rates. The terminal 
process starts at Trial 4, but stays at a constant strength until 
Trial 7. At Trial 7, the terminal process again begins to increase. 

In Table 2, the triangles enclose what appear to be different 
stages of practice for the series of trials. Moving down the main 
diagonal, the correlations in the first triangle on either side mark 
the termination of the rate process. The second set of triangles 
designate intermediate trials in which consistency is mainly due to 
terminal process, but the terminal process is not increasing with 
practice. This is a somewhat unusual stage of practice since termi- 
nal process usually increases regularly. The third set of triangles 
marks late stages of practice in which the terminal process in- 
creasingly determines consistency between trials. 

The influence of stages of practice in determining the function 
of gain in predicting a criterion was investigated by computing 
the patterns of predictability from gain and initial status over the 
total trial matrix. With the exception of Trial 1, which did not 
correlate with the criterion, and Trial 10, the last trial, each trial 
was treated as an initial status measure. Gain was computed 
separately between each initial status trial and each succeeding 
trial of practice. Я 

Table 3 presents the increased percentage of variance accounted 
for in the eriterion when gain score is added to initial status in a 
regression equation. It can be seen that the highest increment in 
prediction occurs when Trial 2 is initial status and gain is computed 
between Trial 2 and Trial 4. The multiple R, not reported in Table 
3, was .52. Also significantly increased prediction at the 01 level 
is the gain between Trial 3 and Trial 4, using Trial 3 as initial 
status. Change between Trial 2 and Trial 3, both within the rate 
process, added nothing to predictability. 
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TABLE 3 
Change in Percentage of Variance Accounted for by the Addition of Gain Scores in 
the Regression Equations 
Gain from Gain to Trial 
Trial 2 3 4 5 7 8 9 10 
2 —  .06 19** .03 .07 .08 .03 11% MN 
3 еі 2 14** (00 .02 .03 .00 .05 08% 
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Change occurring late іп practice, from Trial 8 and beyond, also 
yields significant beta weights for gain scores. It can be seen from 
Table 3 that all gain computed beyond Trial 8 yielded significantly 
increased predictability. It is interesting to note, however, that gain 
from the rate process trials, Trial 2 and Trial 3, to any of the inter- 
mediate trials, Trial 5, Trial 6, Trial 7, and Trial 8, did not produce 
any increased predictability. However, when gain is computed to 
Trial 9 or Trial 10, significant beta weights for gain are seen again. 

The reasons for the lack of increased predictability in the inter- 
mediate phase are not entirely clear. It is possible that the spatial 
reversal task is characterized by two psychologically distinct phases 
which are masked in the intermediate trials. To determine the 
plausibility of this interpretation, the gain scores between each trial 
and each succeeding trial were correlated with the criterion. With 
one important series of exceptions, gain did not correlate with the 
criterion so that the increase of predictability could only be derived 
with gain as a suppressor variable. However, gain from Trial 4 
to each succeeding trial did correlate significantly (гв = .39, 3% 
40, 49, 45, 45; p's < .01), indicating that gain after Trial 4 was 
associated with less efficient performance on the criterion. Ap- 
parently, in the intermediate trials, some persons were still master- 
ing the skills from the first stage of practice, while others were 
already headed toward mastery of the task. Gain in the inter” 
mediate stages did not reflect the differential aptitude of the 
learners in the spatial reversal task. 

To explore the possibility that there may be population differences 

with respect to the improvement of prediction using gain scores 
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the subjects were sorted into two groups. Subjects whose criterion 
performance could be predieted better when gains were added to 
initial status formed one group (N — 27) and subjects for whom 
gain either made no difference or decreased predictability formed 
the other group (М = 18). Trial means and variances were com- 
puted separately for the two groups. None of the comparisons 
between means and variances reached significance at the .05 level. 
Discussion 

Тһе most striking finding from the spatial reversal data is that 
the utility of gain in prediction is highly dependent on the stage 
of practice over which gain is computed. Gain was found to increase 
the predictability of the criterion task significantly when computed 
between early trials or between the late trials, but added nothing 
when computed over the intermediate trials. There was some indica- 
tion that two distinct processes may be confounding the meaning- 
fulness of gain during the intermediate trials because gain after 
the termination of the rate process had a direct rather than a sup- 
pressor relationship with the criterion. These results for the inter- 
mediate trials are consistent with previous research indicating that 
different processes are involved in different stages of practice 
(Dunham et al., 1968), and with the expectation that observed gain 
between two points does not necessarily parallel asymptotic level 
of aptitude. A simple rate measure does not differentiate an indi- 
vidual with extremely underdeveloped potential from an individual 
whose potential is only slightly underdeveloped because of the 
impossibility of determining if the individual is at an ascending or 
descending rate phase on his/her hypothetical ability curve. | 

The results also indicate the potentiality of molar correlation 
analysis to select the optimum measurement points for gain by 
internal criteria, The pattern of rate and terminal processes govern- 
ing intertrial consistency led in this study to the designation of 
three distinct stages of practice. Theoretically, the maximum con- 
tribution of gain to predictability would be from a stage in which 
there are individual differences in rate of change which are not 
accounted for by their asymptotic level, to a stage which reflects 
the asymptotic level. Gain from trials with a strong rate process, 
Trial 2 and Trial 3, to the beginning of the terminal process, yielded 
large and significant increases in predictability. Less clearly an- 
ticipated, but mirrored by the patterns of intertrial consistency, was 
the predictability between late trials but not between intermediate 
trials. 
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Differences in trial means and variability were not found between 
subjects differing with respect to the improvement of predictability 
by the addition of gain scores. Very likely, the factors that will 
differentiate the group which increased in predictability by using 
gain scores will be extrinsie variables, such as age, race, and SES, 
rather than individual differences intrinsic to status measurement. 


Conclusion 


The spatial reversal data indicate that the approach suggested 
here to distinguish aptitude from ability psychometrically has some 
feasibility. It was shown that predictability could be increased by 
adding an index of modifiability to initial status from a test- 
intervention-retest sequence. However, the data also indicate the 
crucial importance of the stage of practice or degree of intervention 
over which modifiability is computed. The degree of success with 
which aptitude can be distinguished from ability depends directly 
on the intervention or amount of practice which is selected. Equally 
important, but less obvious, is the probable influence of the degree 
of intervention in determining which population will show the most 
modifiability, 

As shown in the spatial reversal data, there may be more than 
one stage of practice or degree of intervention which will provide 
Increased predictability. Although not tested in this study, it is 
likely that different populations will be favored by modifiability 
Measures taken from different stages of practice. If sub-cultures 
can be said to differ with respect to favorableness of the environ- 
ment to the development of a given ability, then the average initial 
status points of these populations on their aptitude-ability curves 
will vary. One population may be at a very low point on this curve 
due to an extremely disadvantageous environment while another 
is at a mid-range point. The difficulty is, as previously discussed, 
that the average rate of change between two points does not reflect 
whether instantaneous rate із increasing or decreasing. Thus, if à 
low degree of intervention is chosen, the disadvantaged population 
may show a slow rate of change. There would be no way of knowing 
if this were due to being at the end of the curve (where rate de- 
creases) or at the beginning (where rate increases). 

In applying this aptitude-ability approach to complex tests, there 
are some other issues that must be resolved. "These issues involve the 
reliability of а gain score. Many of the difficulties surrounding the 
use of change scores are avoided in the aptitude-ability model, since 
gain is a predietor rather than a dependent variable. As with any 


| 
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other predictor variable, gain may have both true and error com- 
ponents. However, other paradoxes are unresolved. For instance, 
although classical test theory has assumed that errors of measure- 
ment are the same at each score level, it is probably true that scores 
at the low end of the distribution are more unreliable than those 
at the higher end. This means, then, that gain will be directly corre- 
lated with unreliability. A suecessful distinction of aptitude and 
ability may necessitate some reformulation of measurement theory 
or practice with respect to gain. 

These considerations have important implications for research on 
the aptitude-ability model. The most basic implieation is that 
research which has been traditionally conducted only during test 
validation must now be conducted during test development. Indi- 
vidual differences in change over practice must be studied by sensi- 
tive techniques, such as molar correlation analysis (Jones, 1970a) , so 
that stages of practice can be determined. The influence of individual 
differences in extrinsic factors, such as population characteristics 
and other demographic factors, on change in different stages of 
practice must be studied simultaneously so that the intervention 
which provides the most useful modifiability index can be selected. 
Furthermore, it may be possible to deal with some of the paradoxes 
surrounding gain scores during test, development, by designing tests 
Specifically for measuring change. 

Cronbach and Furby (1970) have considered the selection of 
individuals on the basis of residualized gain such as suggested here, 
to be unclear as to purpose. That is, it is difficult to determine if 
the unexpected gain was accidental, due to underestimation by the 
pretest or overestimation by the post-test. Thus, it is unclear as 
to how these individuals should be differentially treated. How- 
ever, the problem noted by Cronbach and Furby is actually an 
empirical question: will the use of residualized gain scores lead to 
increments in predictability? If the answer to this question is 
affirmative, then it can be assumed that high unexpected gains are 
due to underestimation by the pretest. If some of the difficulties 
involved in measuring the modifiability of complex traits can be 
remedied, a successful psychometric distinction between aptitude 
and ability will have importance both theoretically and in applica- 
tion. Special educational resources and remedial training programs 
can be selectively applied to those who would profit the most. 
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A MEASURE OF THE AVERAGE INTERCORRELATION 


EDWARD P. MEYER: 
University of Chicago 


Bounds are obtained for a coefficient proposed by Kaiser as 
а measure of average correlation and the coefficient is given an 
interpretation in the context of reliability theory. It is suggested 
that the root-mean-square intercorrelation may be а more appro- 
priate measure of degree of relationship among a set of variables. 


Given a set of statistical variables, the nature of the linear rela- 
tionships among the variables can be summarized by the matrix 
of intercorrelations of the variables. A researcher may, on occasion, 
Wish to obtain some measure of the "average" intercorrelation, 
either as an estimate of some common population value or as an 
indication of the degree of relationship among the variables as a 
group. Kaiser (1968) has presented rationale for using as а measure 
of average correlation a coefficient, gamma (y), which is a function 
of the largest, eigenvalue of the correlation matrix and the number 
of variables. The purpose of this paper is twofold: first, to show 
that gamma is bounded numerically by two traditional measures 
of average correlation with equality obtaining under certain condi- 
tions and, secondly, to relate gamma to an estimate of average 
correlation obtained by applying the Spearman-Brown formula to a 
generalization of Cronbach’s coefficient alpha. It is hoped that these 
developments will shed some light upon the psychometric properties 
of gamma and suggest possible cautions with regard to interpreta- 
tion of the coefficient. 


_ 
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Kaiser's Coefficient Gamma 


Let Р = [py] denote a correlation matrix of order n and let 
M € X < +++ < А, denote the ordered eigenvalues of P. Then the 
average correlation measure proposed by Kaiser (1968) is the 
quantity 


y = (№ — U/(n — 1). (1) 


Since P is a correlation matrix, А, must Не in the interval (1, n] 
and hence, by definition (1), gamma must lie in the interval [0, 1]. 
It will first be shown that it is possible to obtain bounds on gamma 
which are tighter than the bounds 0 and 1. 


Lower Bound for Gamma 
Notationally, given the matrix P of intercorrelations of a set of 
^ variables 2, and the n-component constant weight vector = = 
[21, 22, +++, am)’, define the function 
Ха) = z'Pz/z'z, 270. (2) 
It is well-known (e.g., Bellman, 1960) that 


№ < A@) <», тж0, (3) 


with equality obtaining on the left or on the right in (3) if and 
only if x is a characteristic vector of P associated with Ay or А 
respectively. 


Setting x = 1 = [1, 1, --- , l| in (2) yields 


ХІ) = 1+ (n — 1s, (4) 
where 


B= Ian —1 Жо ©) 
ini 
is the arithmetic mean of the correlations py. It then follows from 
(3) and (4) that 
PS (6) 


with equality if and only if 1 is a characteristic vector of P associated 
with An- 

More generally, let £ denote the set of 2” distinct n X 1 vectors 
which can be constructed when the range of the constants 2; is 
restricted to the two values +1 and —1 and define 


Bmax = [Мах A(x) — 1/(п — 1), zin £. (7) 
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en follows from (2) and (3) that 
Poss) (8) 


ith equality if and only if the vector x in £ yielding Pmax is also 
"a characteristic vector of P associated with ,. Inequality (8) simply 
states that must be greater than or equal to the maximum value 
of р for all possible reflections of the original variables. 


М Upper Bound for Gamma 
It will be shown that an upper bound for y can be obtained by 
_ application of the following lemma to the eigenvalues of P. 
Lemma 

Let @ = (a, a; --- , a,] be a set of n real numbers and define 

ш-1/ө Жа; r 
к И (9) 
| в.” 1/n Ж (а = a). 
О 


"Then аы, the largest element іп @, must satisfy 

Qux € He + Vin — lo, (10) 
with equality if and only if the set @ contains one element аьа and 
п — 1 equal remaining elements a; < amar: 


Proof:? Without loss of generality we can assume that dy = Gmax: 
Let bı = a, — pa i 1, 2, °°° , n. Then b, = Bmax, 21” by = 0, 
m? = cj. Thus, 


m n-1 2 
(n — 1) £ bè > (E ь) = (== (i) 
+=1 i=l 
with equality if and only if bı = ba = с = be Adding (n — 1) 


‚ 52 to both sides of (i), we obtain 
nb? < @—1) Y b = ща — Nos = по Do. 00) 
i=l 
| Since а, > pa, (ii) becomes 
а. — pe = [bal = Vb S Vn - 1 0, (ii) 


{ "The remark following (i) shows that equality holds if and only if 


oT 5. 5 
"| Тһе author is indebted to an unknown reviewer for suggesting this proof 
_ of the lemma. 
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by = bp = ۰۰۰ = baa; or, equivalently, if and only if a, = a = | 
Ut = аһа (since b; = a; — ш). QED. | 
In terms of the eigenvalues № of P, (9) becomes 


| 


i = 1 
5 (Ш 
о? = l/n > Ai — 1)? = (n — Пр, 
where 
Ра, = [l/n(n — 1) хх eu l^ (12) 


denotes the root mean square intercorrelation of the n variables. 
Making the substitution (11) in (10) yields 


М < 1+ - 1. 9 | 
ог, equivalently, Ч 


7 S Bra b 
with equality if and only if}, =), =... = 


Sees =). 4 
greater than or equal to zero and frm ill 
qual to one, it follows from (8) and (14) 


bounds for gamma which are tightet 
unds 0 and 1, respectively, 


that Bux and Prms provid 
than the bo 


Generalization of Coefficient Alpha and Average Intercorrelation 
If, without loss 
items of a test of length 
tion to gamma within th 
First, define the functions 


| 

Uu al 2 | 

5-0 Dw о 08 

and | 
Bau) ----40) 


г ИШ)... 
n+ (1 = n)a(w) 
қ -- 
тет VIL w >= 0, ao) 
where 3 denotes the i 
Diag(X) is a diagonal 


- 


tem (variable) covariance matrix and D? = 
matrix such that 


Р = рур, (17) 
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Equation (15) is a generalization of the Kuder-Richardson re- 
liability coefficient, Cronbach's coefficient alpha, to the case where 
items are weighted unequally. with weights w = [w1, Wa, *** , Wal’ 
and equation (16) is the corresponding generalization of the esti- 
mate of average intercorrelation obtained by applying the Spear- 
man-Brown formula to (15) with 1/n as the multiple of test length. 
The generalized coefficient alpha (equation [15]) has been con- 
sidered by Mosier (1943) and by Lord (1958); the rationale under- 
lying use of the Spearman-Brown formula to obtain an estimate of 
average interitem correlation which is independent of test length 
(equation [16]) can be found in Cronbach (1951). 

Note that if the items (variables) are given equal weight i.e., 
w = 1 = [1, 1, -++ , lJ, then а(1) in (15) reduces to Cronbach's 
coefficient alpha (Cronbach, 1951, equation [24]) and р.„(1) in (16) 
corresponds to the estimate of p that one would obtain by applying 
the Spearman-Brown formula to coefficient alpha with 1/n as the 
multiple of test length (Cronbach, 1951, Equations [44] and [45]). 
When items (variables) are not weighted equally, formulas (15) and 
(16) provide more general, but conceptually equivalent, measures 
of reliability and estimated average intercorrelation respectively. 

Making the change of variable y = Dw in (16), 


B) = AU) — 1/m—19, wr (18) 


and it is evident that 


y #0, 


ie, the maximum possible value of g,(w) is Kaiser's coefficient 
gamma. 

Since, by the reciprocal relationship (16), the weight vector w 
which maximizes g,,(w) is also the vector which maximizes alw), 
it follows that gamma can be interpreted as the estimate of р obtained 
by applying the Spearman-Brown formula to the maximum value 
of а(ш), the generalized coefficient alpha, with 1/n as the multiple 
of test length. In other words, if the item. (variable) weight vector 10 
is chosen so as to maximize a(w), then the corresponding Boar (u) 
is also maximized and, as indicated in (19), the value of this maximum 
is gamma. 


Max р., (0) = Max poly) = 7; 


Interpretation of Gamma 


It should be evident from the original derivation of gamma (Kaiser, 
1968) and the subsequent first centroid approximation (Cureton, 
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1971) as well as the result in this paper relating gamma to coeffi- 
cient alpha, that, if one interprets correlation as degree of relation- 
ship, then gamma is an appropriate measure of average correlation 
among а set of variables only to the extent that one has a homogene- 
ous (single factor) set of variables. To the extent that more than one 
factor is necessary to account for the correlations among the vari- 
ables, gamma will tend to underestimate the true degree of relation- 
ship among the variables. In such cases, Ams may provide a more 
appropriate measure of the true degree of relationship among the 
variables. Since this paper has established that y < Ams, with equality 
obtaining only in the “single factor" case with №, = ++ = Avi < 
М, Pr» Would appear to be a more general measure of degree of 
relationship among a set of variables. 

diss P example, taking the classic correlation matrix from Hotelling 

933): 


1.000 698 264 081 
.698 1.000 — .061 .092 
264 — .061 1.000 .594 
.081 .092 .594 1.000, 


one finds р = .278, у = .282, Pm, = .393. Since, for the Hotelling 
matrix, № = 1.846, X, = 1.465, ^з = .521, and А, = .167, the reason 
for the discrepancy between y and р,» should be obvious. 
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EFFECTS OF A CONFIDENCE WEIGHTED SCORING 
SYSTEM ON MEASURES OF TEST RELIABILITY 
AND VALIDITY 


RICHARD C. PUGH лхо J. JAY BRUNZA 
Indiana University 


The validity of a confidence scored vocabulary test, was investi- 
gated by demonstrating an increase in its reliability without 
changing the relative difficulty of the test items and without 
detecting any personality bias in the confidence scoring system. 
The reliability estimate of the vocabulary test increased from .57 
using a traditional scoring system to .85 using а confidence scoring 
system. No significant interaction was found between the difficulty 
of the test items and the type of scoring system. Three personality 
measures failed to correlate significantly with the confidence scores 
of the vocabulary test. 


OBJEOTIVE multiple-choice tests have been designed to allow 
respondents to indicate their degree of confidence in each option 
of a given test item. Ahlgren (1969) has shown that confidence 
scoring systems are effective in improving the reliability of scores 
on objective multiple-choice tests. As the reliability increases, а 
change in the relative difficulty of test items may occur. If a change 
in relative difficulty should occur, additional shifts in the character- 
istics of the test would be suggested. Few studies can be found that 
report a comparison of relative item difficulties. 

z The purposes of this study were (a) to demonstrate an increase 
in the reliability of a multiple-choice vocabulary test using a con- 
fidence scoring system, (b) to determine if a change in the relative 
item difficulties occurs under a confidence scoring system, and (с) 
to assess the relationship of traditional and confidence scored forms 
of a vocabulary test with selected personality measures. These three 
Purposes were related to the validity of confidence scored tests. 
Copyright © 1975 by Frederic Kuder 
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Typically, research studies concerned with the effects of confi- 
dence scoring systems utilize multiple-choice tests that measure 
subject matter achievement. In this study, а vocabulary test was 
selected to see whether reliability could be inereased on a test that 
measured something other than classroom achievement. The vocabu- 
lary test also provided enough items to permit typical estimates of 
reliability within a reasonable length of time. 

Measures of external control, risk taking, and cautiousness were 
obtained to assess personality factors that might relate to the 
vocabulary scores under the testing condition of confidence scoring. 
Тһе internal-external seale delineated individuals according to dif- 
ferences in a generalized expectancy or belief in external control 
(Rotter, 1966). The Kogan and Wallach (1964) questionnaire 
assessed an individual's propensity for risk-taking behavior and the 
Gordon (1956) Personal Inventory measured the general trait of 
cautiousness. 


Method 
Subjects 


The Ss used in this study were graduate students enrolled in 
educational measurement courses in the Department of Educational 
Psychology at Indiana University. These were required courses for 
several degree programs and had students enrolled from various 
subject-matter fields. Four sections of the measurement courses 
were chosen to generate a sample of 84 Ss. The sample consisted of 


55 females and 29 males. The mean age of the sample was 25.5 
years. 


Procedure 


The Ss were administered a 48-item multiple-choice vocabulary 
test consisting of items from the LER. Intelligence Scale (1946). 
Items were selected from each of five levels of the intelligence scale 
and were randomly assigned to one of two sections of the test. The 
two sections of the test were considered to contain alternate sets 
of items. 

Selection of a scoring scheme was based primarily on two reports. 
deFinetti (1965) investigated various answering techniques and 
scoring methods in order to make an adequate appraisal of sub- 
jective probabilities related to confidence testing, Rippey (1970) 
reported a comparative study of different seoring functions for con- 
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fidenee tests. In both studies evidence was presented that simple 
seoring systems yielded relatively favorable characteristics when 
compared to complex scoring systems. The simple scoring systems 
had the overall advantage of facilitating the communication of the 
scoring process to examinees. 

The Ss expressed their degree of confidence in each option by 
assigning a number from 0 through 5 to each of the five options on 
the vocabulary items. The score on the item was the number S 
chose for the correct answer. 

To demonstrate comparative reliabilities of the vocabulary test 
for differing scoring systems, three forms of the vocabulary test 
were created. The three forms (А, B, and C) consisted of the same 
items divided into two sections of 24 items each. The three forms 
differed only in the directions given to the examinees. The directions 
for Form A followed the traditional right-wrong format for both 
sections. The examinees were told that their scores would be the 
number of items answered correctly. The correct answers were given 
а weight of five for convenience in the analysis. The directions for 
Form B followed the confidence system for Section 1 and the tradi- 
tional system for Section 2. The directions for Form C followed the 
confidence system for both sections. The three different, forms were 
randomly assigned to the 84 Ss. 

Prior to taking the vocabulary test all Ss were given a brief train- 
ing session consisting of two parts. The first part was a presentation 
by Es of the confidence scoring system. The second part allowed Ss 
to use the confidence scoring system on practice test items. 

The effect of the different directions for the three test forms was 
studied by comparing the difficulty of the test forms, sections, and 
individual items under the three sets of conditions. A three-factor 
analysis of variance was computed. Differences among the levels of 
the forms, the sections, and the items nested in sections and their 
interactions were determined. 

The reliability of the vocabulary test forms was estimated for 
the two sections and the total test using analysis of variance (Hoyt, 
1941). Relationships of the personality measures with the vocabu- 
lary test were assessed using product-moment correlation coefficients. 


Results 


The relative difficulty of the three forms, the two sections, and the 
forty-eight items were assessed using analysis of variance. All three 
factors were considered fixed. Since all Ss responded to all items in 
both sections, repeated measures were assumed across sections and 
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items nested in sections. An analysis of variance is presented in 
Table 1. 

No significant difference was found among the means of the three 
forms. Although the two sections were intended to be alternative 
sets of the vocabulary items, a significant difference (p « .05) 
between sections was found. The section means indicated that the 
second section was more diffieult than the first. No significant 
interaction was found between sections and forms. A significant 
difference (p « .01) among the difficulty of items was found. This 
was expected since the items were selected from five levels of the 
LER. Intelligence Seale. No significant interaction was found be- 
tween the items and forms, indicating that the relative difficulty of 
the items did not differ significantly among the three forms. 

Table 2 consists of selected characteristics of the sets of vocabu- 
lary test items. 

Estimates of the reliability for Sections 1 and 2 for Form A were 
A8 and .42, respectively. An overall reliability estimate of .57 was 
found for Form A. Reliability estimates for Form B were .62 for 
Section 1 and .48 for Section 2. For Form C the reliability estimates 
were 70 for Section 1 and .78 for Section 2, An overall reliability 
estimate for Form С was found to be .85. 

Product-moment correlation coefficients were computed between 
the three personality measures and each of the two sections along 
with the total score of the vocabulary test forms. None of the coeffi- 
cients were statistically significant at the .05 level. The coefficients 


ranged from 17 to —.31 but a coefficient of 37 was needed for 8 
relationship to be statistically significant (df = 26). 


TABLE 1 


Results of an Analysis of Variance Produced by the Vocabulary Test Items Using 
Three Test Forms 


Source df MS A 
Between Ss 
Forms (F) 
Ss within forms R(F) а э din 
Within Ss à 
Sections (3) 1 17.02 4.07* 
F 2 5 2 2.98 <1 
TOT — 81 4.18 5-2 
Items within sections I(S) 46 84.81 23.62** 
Forms X I(S) 92 4.31 1,20 
Еггог 3,726 3.59 d 
*p < 05. 


**p < .01. 


а 
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TABLE 2 


Means, Standard. Deviations and Reliability Estimates for Two Sets of Vocabulary 
Items Using Three Test Forms, and Coefficients of Correlation between 
the Sets of Vocabulary Items and Selected Personality Measures 


Ext. 
Ж Control | Risk Caut. 
Form and Section X SD Тхх Txy Ty Txy 
Form A 
1 2.45 61 .48 —.15 -.09 .15 
2 2.21 .60 42 —.18 —.18 Bu 
Overall 2.33 .49 57 —.20 05 15 
Form В 
1 2.05 .49 .62 —.31 .09 .01 
2 1.99 .65 .48 —.20 —.07 —.05 
Form С 
1 2.16 -54 70 09 —.05 02 
2 2.08 .66 78 —.08 —.09 00 
Overall 2.12 .56 85 .02 -.08 01 
Conclusions 


By using a confidence scoring system, the reliability of the vo- 
cabulary test was increased without apparently altering the relative 
difficulty level of the items. The estimates of reliability for the 
sections of the vocabulary tests under the confidence testing system 
were substantially higher (.62-.78) than reliability estimates of the 
same sections using the traditional scoring system (.42-.48). Pooling 
together the two sections of the vocabulary test, the reliability 
estimate for Form С using the confidence testing system was 85. 
This was substantially higher than the .57 reliability estimate for 
Form A using the traditional scoring system. The increase in length 
of time to answer the 48 items was from an average of 14 minutes 
for Form A to an average of 19.5 minutes for Form C, a factor of 
only 1.4 times, The increase in reliability could not be accounted 


for solely by an increase in length of time since the effective test 


length was more than 3 times. ) 

No personality measures correlated significantly with any form 
of the vocabulary test. The vocabulary test was not found to have 
à personality bias since the personality measures Were reliable 
enough to allow significant relationsh 
estimates for the external control me 
risk taking .67-.77, and for cautiousness 
variance. The validity of the vocabulary 
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improved based on the increase in reliability using a confidence 
scoring system, since no significant change in the relative difficulty 
of test items was found and no Significant personality bias was 
detected. 
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A TEST FOR HOMOGENEITY OF REGRESSION 
WITHOUT HOMOGENEITY OF VARIANCE 


DONALD H. McLAUGHLIN! 
University of California, Berkeley 


A likelihood ratio test procedure is described for testing for 
homogeneity of regression coefficients and population means across 
treatment, groups without assuming homogeneous error variances. 
Maximum likelihood estimation equations are obtained assuming 
the eriterion scores to be normally distributed, and a method for 
their solution is described. Examples of situation for which the 
procedure is appropriate are given, and extensions of the pro- 
сейшге are diseussed. The procedure has been used on hypo- 
thetical and real data. 


Тевтв for treatment effects on the relation between a set of 
predictor variables and a criterion variable, such as tests for apti- 
tude-treatment interaction (ATI) in educational settings, as well 
ав treatment effects on population means, may be analyzed statis- 
tically, even though the residual variances differ significantly across 
treatment groups. The assumption of homogeneity of variance is 
employed in eurrent methods of analysis primarily to reduce the 
complexity of the computation. The purpose of this paper is to 
develop a testing procedure based on a model including treatment 
effects on residual variances. The main difficulty in the procedure 
is the solution of simultaneous equations for maximum likelihood 
estimates of parameters of the model; and an application of New- 
ton’s method is deseribed which has been tested and found to be 
quite efficient. 

There are two methods for testing for ATI which the proposed 
procedure would replace. The most common method, which is 


cac ee AE. > RNN 
1 Now at the American Institute for Research, Palo Alto, California. 
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normally used with a single predictor variable, is to divide each 
treatment group into two subgroups, on the basis of whether their 
predictor scores are above or below the median. The predietor is 
then treated as merely another factor in an ANOVA model. There 
are at least two disadvantages of this procedure. First, by disre- 
garding predictor variance within the subgroups created, a great 
deal of information in the data is neglected, reducing the power of 
the test. Second, and more important for the application of the 
procedure to the case of multiple predictors, the number of sub- 
groups necessary is an exponential function of the number of pre- 
dictor variables, and unless the predictors are mutually independent 
in each treatment, there may be a great deal of difficulty obtaining 
observations in each subgroup. Also, the use of the simple ANOVA 
model assumes homogeneity of variances across treatments. 

А more reasonable procedure has been developed (Gulliksen and 
Wilks, 1950) which is based on the estimation of regressions of the 
criterion on the predictors for each treatment, Тһе Gulliksen-Wilks 
procedure is based on the assumption that the criterion score, Y, 
for the jth subject in treatment group k, is normally distributed 
with a mean which is the sum of the population mean for that 
treatment and a linear combination of the subject’s predictor scores 
and with a variance, ox. The procedure consists of three successive 
tests, and the problem which the procedure described below solves is 
that the tests are ordered and if one is “failed,” successive tests 
are invalid. The first test is whether the variances are homogeneous 
across treatments. If homogeneity of variance is rejected, then the 
other two tests are not valid. The Second and third tests are for 
homogeneity of regressions and of population means across treat- 
ments, and it is this pair of tests that is likely to be important to 
the Tesearcher, even though the variances are not homogeneous. 
Therefore, using the Gulliksen-Wilks procedure, one risks the possi- 


bility that the outeome may be that hi i i 
i te for 
his hypotheses. рее 


Тһе procedure described 
data as the Gulliksen-Wil 


| isons of the likelihood of the data under 
the assumption that all three factors, means, variances, and regres- 
sions, may vary across treatment groups (Нууу) with three alterna- 
tive hypotheses: that the regressions are homogeneous across treat” 
ments although the means and variances may уату (Hyyg), that 


Ве. 
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the variances are homogeneous although the other parameters may 
vary (Нуку), and that the population means are homogeneous 
although the other parameters may vary (H куу). 

Before considering the details of the procedure, you may wish to 
evaluate the usefulness of this extension of the Gulliksen-Wilks pro- 
cedure. There are at least two types of situations in which the 
assumption of homogeneity of variance is & burden. One is an ATI 
study in which unmeasured aptitudes intrude on the interaction 
with treatments. For example, a researcher might be interested in 
the interaction of two cognitive ability scores, such as verbal reason- 
ing and spatial imagery, with two instruetional methods, in teach- 
ing, say, experimental design. However, although attempts are made 
to equate the instructional methods on requirements of aptitudes 
other than the two of interest, the researcher may not wish to guar- 
antee that he has measured all sources of aptitude effects. For ex- 
ample, if one of the instructional methods induces a greater anxiety 
during the study, if that anxiety is correlated with criterion scores, 
and if subjects differ in sensitivity to the anxiety-producing situa- 
tions, then, assuming that anxiety was not measured, the results 
would show an apparent heterogeneity of error variance. Those 
treatments which induced greater anxiety would have an extra 
source of error variance. Therefore, the Gulliksen-Wilks test for 
ATI would not be valid. 

The second type of situation where the advantages of the pro- 
cedural independence of the three tests are apparent is in the 
analysis of covariance in which the treatment groups are actually 
naturally occuring groups. One may be interested in testing for 
differences in population means without regard to differences in 
other population parameters. For example, one may be interested 
in sex differences in mathematical problem-solving, corrected for 
amount of mathematical experience, even though the contribution 
of that experience to the criterion differs between sexes. If, in fact, 
there were sex differences in the predictor scores, and if the actual 


relation between experience and the criterion were not, linear, then 


differences in the regressions between sexes would be expected. The 
would, therefore, not 


ordinary procedure for analysis of covariance 
be valid. 

We turn now to a description of the alternative procedure. The 
data to be analyzed are observations of a single criterion score, У», 


and a vector, Xj, of predictor scores for each of n subjects in the 


kth of m treatment groups. Each of the criterion scores is assumed to 
uted random variable whose 


be an independent normally distrib 
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mean is the sum of the population mean for that treatment, ш, and 
a linear combination, 8, Хҙҙ, of the predictor scores for that subject, 
and whose variance is оу. Although not necessary, the scores are 
usually assumed to be interval level measurements rather than ratio 
seales, so the grand mean is subtracted from each score. The likeli- 
hood of the data given this model, which we have denoted Нуур, is: 


BOB I I (re!) 


“exp (= (Yi — mr — (Ха) 2,29. (1) 
The likelihood of the data under the three alternative hypotheses 
to be considered, Нуур, Нуву, Heyy, are obtained by replacing 8, by 
B, т? by «?, and uy by p, respectively. 

Once the likelihood of the data under each hypothesis is cal- 
culated, using maximum likelihood estimates of the free parameters, 
the ratio, А, of the likelihood of the data given the more constrained 
of two hypotheses to the likelihood given the less constrained model 
is calculated. It is well-known (Wilks, 1962) that —2 log,(A) is 
asymptotically distributed as a chi-square variable with the number 
of degrees of freedom equal to the difference in the number of 
parameters constrained by the two models, if the more restricted 
model is true. Larger values of the statistic are to be interpreted 
ав rejections of the more constrained of the two models on the basis 
of the data. 

The problems in applying the procedure lie in the solutions of 
the equations for the maximum likelihood estimates of the free 
parameters. Simultaneous equations for the estimates are given 
in Table 1, for each of the four hypotheses. They can be seen to 
depend only on the sample moments of the criterion and the pre- 
dictors in the treatment, groups. In the cases of Нууу and Нуву, 
the solutions of these equations are straightforward: the estimates 
of the regression coefficients are obtained first, and those estimates 
are used in the equations for the estimates of the population means 
and error variances. These are the only two sets of equations which 
must be solved to apply the Gulliksen-Wilks procedure. For Hyrs 
and Hayy, the hypotheses of equal regressions and of equal means, 
without assuming equal variances, the equations cannot be solved 
directly. However, they can be solved iteratively, by Newton's 
method. The amount of computation in such a solution is significant, 
but given the availability of a computer and intelligent choices of 
variables for iteration and starting values, the procedure is quite 
efficient, taking about three iterations to reach negligible error levels. 
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For H yyg, a function F, of the vector of variances, is defined by: 


FG") = ê,” — (var, (Y) - 28 (cov, (X, Y) 
+ (соу, (X, XA), Е-1,-.т, @ 


where В is the function of 27 given in the Hyy row of Table 1. The 
maximum likelihood estimates of the &,"s are the roots of F. Initial 
values for the variances, бо)”, are set equal to the values obtained 
for Hyyy. Successive values of the variances are then obtained by 
the equation: 


бику? so bo zm UI Fo, (3) 


where the (k, 1) element of the matrix J is the derivative of F, with 
respect to ê, , and is given by: 


Ju = ы — FH (oov, (X, Y) — өз, (X, X^) 


iz moore (E, X) coy, (X, Y) — cov: (X, XB), 


where ôx is one if k = land zero otherwise. 
For H ьуу, a function С(р) is defined by: 


У | y п 
G(à) Lnd > e (У, B, GG) ^od й), (4) 
where 6," and В, are functions of û as given іп the Н куу row of Table 
1. The maximum likelihood estimate of д is then the root of G. The 
initial value of û is obtained by solving G(g) = 0 with values of the 
variances and regressions estimated for Hyyy. Successive values 


of û are obtained by an equation analogous to (3), where the deriva- 
tive of С with respect to р is: 


"on | 
È gê QU. AE — aye? 

+ Ў, (соу, (X, X) + X,X,) X, — 1. 
i Given maximum likelihood estimates of the free parameters, the 
likelihood ratios can easily be shown to depend only on the esti- 
mates of the population variances, 47. The chi-square statistics 
and their respective degrees of freedom are given in Table 2. The 
three tests are, of course, not statistically independent, because 
they are all based partially on the same statistics of the data. How- 
ever, they are procedurally independent in that the validity of 
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TABLE 2 
Test Statistics 


Test —2. loge) df 


Hyves Уз. Нууу En log. (um e gra) p(m — 1) 
-і 


Нуву vs. Нууу Ут log, (дулат Ибн т —1 


kel 


Heyy vs. Нууу En log. CHE ed EUMD) тете 
k=l 


Note.—p is the number of predictor variables, m is the number of treatment groups, and the 
estimates of variance are the solutions to the equations expressed in the corresponding rows of 
Table 1. 


each is independent of the outcome of the other tests, and they are 
orthogonal in the sense that the existence of treatment effects on 
any of the model parameters does not alter the values of the 
statistics for the other tests. For example, if an extra source of 
error variance is introduced into one of the treatment groups, which 
is not correlated with the predictor scores, then the test for homo- 
geneity of regression will not be altered. On the other hand, each 
of the tests is based on all of the sample moments, so alteration 
of the sample moments would effect the outcome of all three tests. 

There is a point to be made concerning the interpretation of 
homogeneity of regression in the context of apparent heterogeneity 
of error variance. The test for homogeneity of regression is sensitive 
to variation across treatments of the relative contributions of the 
set of predictors, rather than to variation of the total amount of 
predictability. In fact, a situation in which the error variance 18 
homogeneous and the regression coefficients are all multiplied by 
a constant which is characteristic of each treatment, is theoretically 
indistinguishable, in terms of the model used, from a situation in 
which the regression coefficients are homogeneous and the error 
variances differ. However, this ambiguity is not of great importance 
in ATI designs because the researcher is usually looking for just 
those regression differences for which the test is sensitive: for 
differences in the relative contributions of different aptitudes to 
performance in different treatment conditions. 

There is one extension which is simple to implement, however. 
It is based on the fact that the estimation procedures for any веб 
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of treatment groups are independent of the procedures for other 
treatments, assuming that there are no parametric constraints be- 
tween different sets of treatments. For example, in а two factor 
experiment, the parameters can be estimated at each level of one 
factor independently of their values at other levels of that factor. 
The likelihood of the data of several Sets of treatment groups is 
just the product of the likelihood in each Set, so tests can be made 
between the alternative hypotheses that the parameters of the model 
are homogeneous within sets of treatments versus that they are not 
во homogeneous. In the two factor experiment, this means that 
the effects of each treatment factor can be examined without regard 
to the other factor. 

This extension has been included in the FORTRAN IV computer 
program used to test the procedure. That program is available 
upon request, although it consists of little more than an expression 
of the equations in this paper. 
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CONVERGENT AND DIVERGENT MEASUREMENT OF 
CREATIVITY IN CHILDREN? 


WILLIAM C. WARD 
Educational Testing Service 


Fourth through sixth grade children were given two types of 
creativity measures—divergent, measures in which the child named 
all the ideas he could that met a simple requirement, and conver- 
gent measures, adaptations of Mednick's Remote Associates Test, 
in which he attempted to find one word that was associatively 
related to each of three others. Divergent and convergent mea- 
.sures shared little variance, and the latter were strongly correlated 
with IQ and achievement. Moreover, convergent items requiring 
production of the correct association were strongly related to items 
requiring only recognition. It was argued that in children Remote 
Associates performance depends on individual differences other 
than the size of the associative repertoire. 


Mepnicx conceptualized the creative process in associative terms, 
seeing it as involving "the formation of associative elements into 
new combinations which either meet specified requirements or are 
in some way useful” (Mednick, 1962, p. 221). Individual differences 
in creativity were seen as depending on differences in the number 
and relative strength of associates the individual has available 
that are relevant to a problem. This formulation is schematic—what 
constitutes an element is not explicated, and several processes by 
which elements can come into association are mentioned but not 
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explained. It is also limited in seope—the discussion is focused on 
the associative substrate required for creativity, and does not in- 
clude description of the control processes, e.g., personality and moti- 
vational variables, that must influence whether and how creativity 
is manifested in a problem situation. Nonetheless, it has been highly 
influential, since it provides a link between a highly complex 


phenomenon and simpler, ostensibly better understood, associative | 


processes. 

Two kinds of creativity measures have been rationalized in terms 
of this scheme. One of these, Mednick’s Remote Associates Test 
(RAT), provides the subject with three words and requires that 
he find an additional word which is associatively related to all of 
those given (Mednick, 1962). For example, he is given surprise, 
line, and birthday; the solution word is party. Subjects presumably 
attempt to solve the problem by scanning their networks of associa- 
tions to each of the problem elements and testing whether one of 
the resultant associations is common to all the networks. The cre- 
ative subject has more associations available and therefore is more 
likely to find the one which satisfies the requirement. 

Wallach and Kogan (1965) measured the extensiveness of the 
associative repertoire much more directly. Their subjects were asked 
for all the ideas they could give that met a simple problem require- 
ment; for example, to name uses for an object, such as a shoe. 
Here the creativity Measures were the number of relevant ideas, 
and the number of such ideas which were unique, given to each 
task. They noted that these two measures were likely to be related 
to one another: “... it is quite possible that more frequent asso- 
ciations will occur earlier and more unique associations later in & 
Sequence, so that individuals who are able to produce a larger 
number of associations also should be able to produce a greater 
number of unique ones" (Wallach and Kogan, 1965, p. 14). 

The two types of tests differ in that the RAT is convergent in 
form, requiring the production of a single predetermined solution 
to each problem, while the Wallach-Kogan measures are divergent, 
requiring many solutions. Nonetheless, both are rationalized as tests 
of the size and scope of the supply of associations the subject 18 
able to generate given a simple problem; they differ in the directness 

of the test of this supply, not in the hypothesized continuum of 
individual differences that is under examination. It is worth ex- 
amining, therefore, whether the two kinds of performance are re- 
lated to one another. If they are substantially intercorrelated, it 
would provide evidence that the number of associations available 
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indeed an important individual difference variable underlying 
- performance on the Remote Associates Test. If not, Mednick's ex- 
planation of RAT performance—though not necessarily the useful- 
` ness of the test—can be called into question. 
The present study tested the relationship of the two kinds of 
| measures in fourth through sixth grade children. The Wallach- 
Kogan measures were designed for children of this age and required 
no important modification. The Remote Associates Test, however, 
was intended for adults. Two equivalent forms of the test were 
- developed for use in this study, using some items taken from an 
unpublished children's version of the Remote Associates Test (Med- 
“nick and Mednick, 1962), plus a number of new items. Half the 
items in each form were presented as in the adult versions of the 
task, while half were given in a recognition format—each could 
` be answered with one word from a list printed at the bottom of the 
test form. Use of this format served two purposes. First, it helped 
to assure that at least one part of the test would be of an appro- 
priate difficulty level for children at each of the grade levels tested. 
x Second, if both kinds of items should fall аба reasonable difficulty 
` level for the children іп this study, the interrelation of the two 
parts of the test would provide a further test of the degree to which 
the number of associates the subject has available is the crucial 
factor determining his level of performance. A recognition format 
| should eliminate any differences dependent on the efficiency of 
- memory search (McCormack, 1972), making the possession of the 
associative link and the ability to evaluate correctly its relevance 
the sole requirements for correct performance. 


Method 


Subjects 

Subjects were the 65 children, 26 males and 39 females, in one 
fourth, one fifth, and one sixth grade class of а predominantly black 
urban elementary school. Fourth grade Lorge-Thorndike IQ's, avail- 
able on approximately half the sample, averaged 90.5 (SD = 13.2). 


Measures 

Modified versions of two of the creativity measures developed 

- by Wallach and Kogan (1965), the Uses and Pattern Meanings 

tests, were employed. In the first of these the child was asked to 

Name uses for a common object; in the second, he gave possible 
interpretations of a simple abstract pattern. Each test consisted of 
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an example, followed by four test items; each test item was pre- 
sented on a separate ruled page in a test booklet. 

Two twenty-item forms of the Remote Associates Test were also 
employed. Each item consisted of three words, all associatively re- 
lated to the same fourth word. Items were randomly assigned to 
forms and to order within forms. After instruetions and four ex 
amples, the child was given one page containing 10 items in a 
recognition format. The 15 words listed at the bottom of the page 
included the answers to all 10 of these items. On a second page 
were presented 10 more items on which the child had to generate 
his own answers, 

Procedure 


The tests were administered to intact classes in two sessions 
during the same week late in the school year. In Session 1, subjects 
were given the Pattern Meanings Test and then one form of the 
Remote Associates Test. In Session 2, they were given the Uses 
Test, followed by the second form of the Remote Associates Test. 
All testing was conducted by the same female research assistant; 
а male aide was present during the sessions, and the teacher some- 
times remained in the room. 

Admirfistrative details were kept similar for all measures, 80 88 
to avoid, so far as possible, the introduction of method differences 
into the comparison of convergent and divergent measures. On the 
two divergent tests, labeled “What can you use it for?” and “What 
could it be?”, the tester read through the instructions with the sub- 
Jects, presenting an example item and eliciting responses from the 
class. The subjects then wrote down their ideas for each item. They 
were given five minutes per item, a time limit which was generous 
for most subjects. Children were told not to worry about spelling; 
the tester and the aide were available to help with wording if needed. 
The general testing atmosphere was businesslike—children were 
kept to the task, but with as little emphasis on time limits or 00 
the evaluative aspects of the situation as was feasible. 

Р Instructions and examples for the convergent measures, labeled 
"Related Words," were also read through by the tester and the 
subjects. Each item was then read aloud by the tester; the child 
had one minute to find or generate the answer and write it in 9 о 
blank next to the three given words. 

Scoring т 

The Remote Associates Test items were initially scored accord- 

ing to a key containing the intended correct answers. Two judges 
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then examined those answers that had been scored wrong and, 
in a few cases, agreed that an answer not on the list was acceptable. 
The Uses and the Pattern Meanings tests were scored for number 
of ideas—the total number given, less only repetitions, incompre- 
hensible responses, and those judged to be inappropriate. These tests 
are generally also scored for uniqueness, the number of acceptable 
ideas which are given by one child in the sample; but in previous 
studies uniqueness and number of ideas have been so highly cor- 
related, frequently in the .80’s for the two scores derived from 
the same test, that this score appears to provide little additional 
information (Ward, 1968, 1969). 


Results 


Both the divergent and the convergent creativity measures showed 
substantial increases in mean level of performance from the fourth 
grade class to the two older ones, with little difference appearing 
between the latter two. Correlations among the measures were com- 
puted within each class and then averaged over classes, using 
Fisher’s r-to-z technique. Correlations were also computed on scores 
standardized on class means and standard deviations; these co- 
efficients did not differ systematically from those reported below, 


and are not presented. ) 
In Table 1 are shown the intercorrelations of the two divergent 


ereativity measures, the two forms of the convergent creativity 
measure, fourth-grade Lorge-Thorndike 10, and the composite score 


TABLE 1 
Correlations among Creativity and Ability Measures 


Pattern Remote Remote 


Meanings Assoc. A Assoc. B 19 Achievement 
T NTN r N r N r 
Uses—Number “67% 65 32° 65 19 645 92.0 42 
attern Mean- 
ings—Number 33° 65 .3* 65 .39* 32 4“ 42 
з 
е ое 65 .03"* 32 .64*** 42 
Remote 
Associates— 

50% 32 .62** 42 
ee та 
МИНИТЕ ааз аты 

*p < .05. 
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from the preceding spring's administration of the Iowa Tests of 
Basie Skills; IQ and achievement data were unavailable for some 
subjects. Similar matrices of correlations were also calculated sep- 
arately for each sex; no systematic differences were found, Each 
type of creativity measure possessed a high degree of reliability 
across alternative tests. The two divergent Measures, number of 
ideas on the Uses and on the Pattern Meanings tests, intercorrelated 
67, while the two forms of the Remote Associates Test had an inter- 
correlation of .82 (р < .001 in each case), However, the two types 
of test had only a minimal relation to one another; their intercor- 
relations ranged from .19 to 34, with three of the four coefficients 
significant at the .05 level. 

In fact, the convergent and divergent measures shared little 
variance that was not also shared with 10 and achievement scores. 
Achievement scores may be the better indication of general ability 
level in this sample. They are more recent, the achievement tests 
haying been given one year before the present testing, while IQ was 
tested while the child was in the fourth grade. Achievement had a 
moderate positive relation to divergent creativity measures (r's of 
:25, n.s., and 41, p < 05), but a strong correlation with convergent 
creativity (r’s of .64 and 62, р < .001). 

In Table 2 are shown the partial correlations among divergent 
and convergent creativity measures with achievement held con- 
stant. While all the correlations in the matrix were somewhat re- 
duced by the removal of achievement variance, each type of 
creativity measure continued to show substantial internal con- 
sistency (p < .001); and the correlations between divergent and 
convergent creativity were reduced to negligible magnitude. A 
similar analysis was done, partialling out 10) rather than achieve- 
ment, for the 32 students having complete creativity and IQ data. 
As before, correlations within divergent and convergent creativity 


TABLE 2 
Correlations among Creativity Measures with Achievement Held Constant 
Pattern Remote Remote 
Meanings Assoc. A Assoc. B 
Uses—Number .64* .20 03 
Pattern Meanings—Number .09 112 
Remote Associates—Form А “70% 


N = 42 subjects with complete data on all 


R P the above measures. 
р < .001. 


| 
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remained high (.75 and .73, respectively; p « .001), while cor- 
_ relations between these two types of measures all failed to achieve 
statistical significance (average r = .17; range from .03 to .31). 

Product moment correlations among the recognition and pro- 
duction parts of the Remote Associates Test were also obtained. 

-—— Within each form of the test, these two parts were highly correlated, 

— with r’s of .63 for one form and .71 for the other. Across forms, 
7 the recognition scores correlated .65 and the production scores cor- 
1 related .58 (all p's < .001). The two kinds of items, finally, showed 
equivalent relations to standardized achievement scores; for Ње ` 
sum of the scores оп the recognition items over the two forms of 
the test, the correlation with achievement was .65; while for the 
sum over production items, it was .64 (p < .001). Thus, there is 
no indication that the two kinds of items required different abilities 
from the subject. 

Discussion 

It has been a common problem in creativity research that one 
investigator's measure of creativity turns out to be unrelated to 
another's. To some extent, this problem represents differences in 
| the choice of the level at which creativity is operationalized 
: (Taylor, 1959). The two types of measures studied here, however, 
have been presumed to be measures not only at the same level, but 
of the same process variable—the number of relevant associations 
the subject has available in simple problem situations, The Wallach- 
Kogan measures provide a direct assessment of this variable; and 
in this study, as in earlier work, these measures proved to possess 
both substantial reliability across alternative tests and diserimin- 
ability from general intelligence and achievement, measures.” 

Remote Associates performance shared little variance with the 
divergent creativity measures, and therefore appears, contrary to 
Mednick’s rationale for the test, to depend on variables other than 
the size of the associative repertoire. Moreover, the correlational 
similarity between recognition and production scores suggests that 
factors associated with the speed or efficiency of memory search 
for the relevant associate are not critical for performance. One 


gf eae с "т 


IPRS 5 

2 Wallach and Kogan (1965) argued the importance of an eyaluation-free 

testing context for Serie EEE A definitive test of this proposition 

not been made; however, the present results, along with data presented 

by Ward (1971), suggest that а group testing situation in which time limits 

are ample and evaluational cues are minimized is adequate for creativity 
Assessment. 
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possible contributor to test performance might be individual dif- 
ferences in evaluative abilities (Frederiksen and Messick, 1959; 
Guilford, 1956)—as well as possessing the appropriate association, 
the subject must be able to decide that it is indeed appropriate. 

In the present data, whatever abilities are responsible for Remote 
Associates performance appear also to contribute to IQ and achieve- 
ment test scores. Similar findings have been presented in work with 
older children (Belcher and Davis, 1971; Warren, 1971) and with 
adults (Laughlin, 1967). A few studies with adults have found the 

"test to measure something more than general intelligence; for ex- 
ample, showing a positive relation to incidental learning (Laughlin, 
1967; Mendelsohn and Griswold, 1966). With children, however, 
16 remains to be demonstrated that Remote Associates represents 


more than an unusual approach to the measurement of general 
intellectual ability. 
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LONGITUDINAL STUDIES OF RISK TAKING ON 
OBJECTIVE EXAMINATIONS 


MALCOLM J. SLAKTER Ax» KEVIN D. CREHAN* 
State University of New York at Buffalo 


ROGER A. KOEHLER 
University of Nebraska 


Risk taking on objective examinations (rtooe) is defined as 
guessing when the examinee is aware of a penalty for incorrect 
responses (Slakter, 1967). Since rtooe measures can be obtained 
from Ss ostensibly taking aptitude or achievement tests, they pro- 
vide psychologists with useful disguised measures of risk taking. 
Prior studies have indicated that rtooe is related to dominance- 
submission (Votaw, 1936), maladjustment (Sherriffs and Boomer, 
1954), vocational choice (Ziller, 1957), curriculum choice (Slakter 
and Cramer, 1969), and perception of risk in military situations 
(Torrance and Ziller, 1957). Furthermore, it has been demon- 
strated that examinees low in rtooe tend to be penalized on test 
score (Hammerton, 1965; Sherriffs and Boomer, 1954; Slakter, 
1968a; Slakter, 1968b; Votaw, 1936). These latter studies. have 
shown that when Ss low in rtooe are forced to respond to all items, 
their average test score increases even though the usual penalty 
for incorrect responses is applied. Hence, we have evidence that 
rtooe confounds the aptitude or achievement, being measured by 
the examination, and therefore rtooe concerns individuals involved 


with educational measurement. 


Past studies of the relationship of risk taking to age ог sex have 
all been cross-sectional in nature. For example, Wallach and Kogan 
(1961) compared older subjects (mean age approximately 70) with 
college students on а hypothetical choice dilemmas instrument. 
They found that the older subjects were more conservative. In 
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studies of six to ten year old children, Kass (1964) found no age 
difference in gambling with pennies іп a slot machine. However, 
Cohen (1960, chapter 5) examined nine-, twelve-, and fifteen-year 
old children in a risk taking situation with candy for prizes. He 
found that the nine year old children took greater risks than the 
12 year olds, who in turn took greater risks than the 15 year olds, 
Finally, in a cross-sectional study of rtooe in grades five through 
eleven (Slakter, Koehler, Hampton, and Grennell, 1971), results 
demonstrated that students in grades five, six, and seven displayed 
higher risk taking than students in grades eight through eleven. 

Cross-sectional studies examining the relationship of sex and risk 
taking have reported conflicting findings. For example, Wallach 
and Kogan (1959, 1961) have found no consistent sex differences 
in risk. Kass (1964), however, reported that boys selected greater 
risks than girls with the slot machines. On the other hand, using 
a decision-making task with candy as the prize, Slovic (1966) 
found a sex by age interaction. His results indicated no sex dif- 
ferences in younger children from ages six to ten, but with eleven- 
to sixteen-year olds greater risk was manifested by the boys. Spe- 
cifically with rtooe, no sex differences were found in college students 
(Slakter, 1967; Slakter and Cramer, 1969), eighth grade females 
were found to be higher in rtooe than eighth grade males (Slakter, 
1969), but the opposite was found for ninth grade students (Swine- 
ford, 1941). In a study of grades five through eleven, Slakter et al. 
(1971) reported a weak and inconsistent relationship between sex 
and rtooe. 

АП cross-sectional studies suffer from limitations (see Hilton and 
Patrick, 1970) due to possible cohort differences (1.е., different 
populations at the different grade levels) and cohort changes (e.g 
dropouts). Therefore, it was decided to conduct longitudinal studies 
of the relationship between rtooe and age. The present study ex- 
amines Jongitudinal data providing information on (a) the re- 
lationship between ttooe and age, (b) the age by sex interaction, 
and (с) the stability of rtooe. 


Method 


The measure of rtooe was based upon the use of nonsense items, 
where а nonsense item is defined as one that has no correct (or 
best) answer, and no incorrect answer for the given population. 
Previous research (Slakter, 1967; Slakter, 1969; Slakter and 
Cramer, 1969; Slakter and Koehler, 1968) has tablished that five 
nonsense items embedded in five legitimate items yield Kuder- 
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Richardson formula 20 (KR-20) reliabilities in the vicinity of .80. 
In addition, since there is evidence that rtooe is а general trait 
across different types of examinations (Slakter, 1969), convenient 
synonym-antonym vocabulary items were employed in the measure. 
Ss were directed to indieate whether the words had the same or 
opposite meaning, and were informed of the penalty for incorrect 
responses. The following is an example of a nonsense item used in 
the measure: 


Since “marnel” is meaningless in the English language, the above 
item has no correct or best answer. Hence, any response (ie, 
"same" or “opposite”) is assumed to be an example of rtooe be- 
havior; if the item is omitted, а lack of rtooe behavior is indicated. 
In order that age differences could be examined, the nonsense items 
which formed the basis of the rtooe measure were constructed so 
that they could be used at all grade levels; the legitimate items 
were selected to be appropriate for the particular grade level, and 
usually appeared at a single grade level. The rtooe score assigned 
to an S was the proportion of nonsense items attempted. 

Ss for the study were all available publie school students in 
grades five through eleven in a large village in western New York 
State. For the first testing in 1968 there were a total of 1,070 Ss, 
consisting of 522 males and 548 females. The number in each grade 
varied from 118 to 228. The tests were administered to the Ss in 
their own classrooms by their own teachers. The teachers were in- 
structed as to standardized procedures of administration. The Ss 
were generally led to believe that they were taking another aptitude 
examination in their school’s testing program. The tests were given 
as Part I of an “aptitude” examination on the same day to all 
classes in a given school, and within several days to the entire 
school system. The same procedure was repeated in 1970 to collect 
data after the passage of two years time. At this second testing, 
there were 1,049 Ss, with 536 males and 513 females. The number 
in each grade varied from 110 to 190. А 

We can classify the data into four sets, which are not necessarily 
mutually exclusive. We have (a) a set of cross-sectional data from 
1968 (described previously in Slakter, et al. [1971]) designated as 
68X, (b) a set of cross-sectional data from 1970 designated as 70X, 
(c) a set of unmatched longitudinal data which includes all the 
students tested each time and symbolized UL, and (d) a set of 
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matched longitudinal data which involves only those students tested 
in both 1968 and 1970, symbolized ML. 


Results and Discussion 


Table 1 presents the KR-20 reliabilities for the rtooe measure. 
The values for the 68X data ranged from .68 to .86 with a median 
of 83; the values for the 70X data varied from .69 to .86 with a 
median of .82. Hence, the five-item rtooe measure had compara- 
tively high internal consistency for grades five through eleven in 
both test administrations. 

_ The mean rtooe scores for the cross-sectional and longitudinal 
data are presented by sex within grade in Table 2, The numbers in 
parentheses are the sample sizes. (The numbers of Ss in grades 
nine through eleven exceed the numbers in grades five through 
. eight because of transfers into the system at the ninth grade. This 
cohort change represents more of a problem with interpretation of 
cross-sectional data than longitudinal data.) Note the close sim- 
ilarity between the cross-sectional and matched longitudinal means 
for both the 68 and 70 samples. This similarity indicates little cause 
for concern due to bias from selection or nonrandom loss of subjects. 

i Table 3 provides the mean rtooe differences over the two year 
time period by sex. The first column was calculated by subtracting 
68X rtooe means (Table 2) separated by two year intervals; e.g., 
the difference for males from grades five to seven (.01) was found 
by subtracting the 68X mean for fifth grade males (.92) from the 
68Х mean for seventh grade males (.93). The second column was 
found in similar fashion from the 70X means. The values in the 
third column (ML) were found by subtracting the mean in the 
68ML column in Table 2 from the appropriate mean in the 70ML 
column; e.g., the change for females eighth to tenth grade (—.13) 
was caleulated by subtracting the 68ML mean for eighth grade 
females (.73) from the 70ML mean for tenth grade females (.60). 
Entries in the last column were calculated by substracting the 68X 
mean from the appropriate mean in the 70Х data; e.g., the mean 
difference for females in the ninth to eleventh category (—.06) 


TABLE 1 
КЕ-20 Reliabilities for Cross-Sectional Data 


Grade 5 6 7 8 

9 10 1 
68X 78 68  .85  .76 .86 .85 .83 
70X 82 76 . .82  .69  .86 .84 .84 
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TABLE 2 
Mean rtooe by Sez within Grade (Sample Size in Parentheses) 
Grade Sex 68X 70X 68ML 70ML 
5 m .92 (75) .78 (63) .93 (57) 
f .85 (60) .67 (60) .86 (42) 
6 m 92 (63) .92 (80) .92 (50) 
f 87 (56) .93 (64) .89 (43) 
7 т 93 (59) ‚16 (72) .93 (50) 114 (57) 
f 96 (71) .85 (52) .95 (53) 87 (42) 
8 m 79 (48) .66 (60) .81 (37) 66 (50) 
f 73 (70) .63 (50) .73 (59) 63 (43) 
9 m 72 (99) .79 (94) .72 (63) 85 (50) 
f 71 (129) .82 (96) .69 (85) 86 (53) 
10 m 76 (85) .69 (80) 71 (37) 
f .68 (80) .60 (95) 60 (59) 
11 m .61 (93) .71 (87) 73 (63) 
f .66 (82) .65 (96) 66 (85) 
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resulted from the subtraction of the 68X mean for the ninth grade 
females (.71) from the 70X mean for the eleventh grade females 
(65). Note that this last column provides the mean differences for 
the unmatched longitudinal data. 

From an inspection of Table 3 we see that the mean rtooe 
changes for the cross-sectional data differ from those of the longi- 
tudinal data, both matched and unmatched. For example, in the 
cross-sectional data we find essentially no age change in rtooe over 
the eighth to tenth grade period, whereas the longitudinal data pro- 
vide evidence of a decrease in rtooe for this grade interval. These 
differences between the cross-sectional and longitudinal mean 
changes may be attributed to cohort differences or to cohort changes 


in the cross-sectional data. 


TABLE 3 

Grades Sex 68X 70X ML UL 

507 01 -.02 —.19 —.16 
"t ET 18 .01 100 

6 08 т —.18 —.26 -.26 -.26 
4 — 14 —.30 -.26 224 

7409 т ED .03 —.08 =. 
4 2725 —.02 —.10 E 

8 to 10 m —.03 .03 --10 10 
f —.05 —.03 E Ps hae 

9 to 11 —.11 —.07 : $ 
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Since the matched longitudinal data provides information on 
changes in the same people over a two year period, we assume 
that these data are the most informative. Repeated measures anal- 
yses at the .05 level on the ML data indicate: (a) significant de- 
crease in mean rtooe for the periods sixth to eighth grade, seventh 
to ninth grade, and eighth to tenth grade, (b) a significant sex 
by age interaction for the fifth to seventh grade time period, with 
the males displaying a significant decrease in rtooe, but the females 
remaining constant and (с) no difference in rtooe for the ninth 
to eleventh grade period. 

It was conjectured that perhaps the education process accounts 
for the derease in mean rtooe by teaching the high risk takers to 
become more conservative. This conjecture was investigated by ex- 
amining the bivariate scatterplots of the test-retest data for grades 
6 to 8, 7 to 9, and 8 to 10. If the conjecture were true, we would 
expect to observe that many Ss high on the first administration 
would be low on the second. An examination of the scatterplots 
failed to uncover this relationship. Therefore, no evidence was 
found to indicate that the decrease in mean rtooe is caused by the 
education process. Perhaps, as Cohen (1960, р. 110) says: “...а 
relatively higher proportion of the younger ones may prefer to 
gamble because their hope (or psychological probability) of winning 
the prize is relatively greater than those of the older children.” We 
will shortly consider the possibility that the decrease in mean rtooe 
may be attributed to developmental change. 

To summarize the analyses of the ML data, there was a strong 
tendency for Ss to decrease in mean rtooe over grades 6 to 8, 7 to 9, 
and 8 to 10. There appeared to be little evidence for an age by 
sex interaction, except at the fifth to seventh grade period, and no 
evidence for sex differences in mean rtooe. 

Test-retest reliabilities (stabilities) over the two year period for 
the ttooe measure were calculated from the ML data and are 
presented in Table 4. Note that the rtooe measure is extremely un- 
stable for males until the ninth to eleventh grade period, while with 
the females the rtooe measure becomes somewhat stable at the sixth 
to eighth grade period and quite stable from the ninth to eleventh 
grade. The high KR-20's that we found with the cross-sectional data 
(Table 1) together with these low test-retest reliabilities indicate 
that rtooe tends to be a temporary characteristic of male students 
in grades five through nine (and perhaps longer), and in females 
from grades five through eight. However, rtooe appears to be 8 
lasting characteristic for females from grade nine to eleven. 
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TABLE 4 
Test-retest Reliabilities (Stabilities) over Two Year Period (ML Data) 


Grade period 
Sex 5 to 7 6 to 8 7409 8 to 10 9 to 11 
male .08 .10 .20* —.32* .38* 
female Bt] .94* .34* 3877. 72% 


* Significant at .05 level. 


In observing that mean rtooe tends to stablize (Table 3) at 
about the same time that rtooe appears to become a more lasting 
characteristic (Table 4), one might speculate that Ss do not have 
a developed concept of risk-taking on objective examinations until 
the ninth grade. Piaget and Inhelder (1951) have suggested that 
the probability concept is not aequired until after age eleven. 
While other researchers have provided evidence that the probability 
concept may be acquired before age eleven (e.g. Yost, Siegel, and 
Andrews, 1962), Hale, Miller, and Stevenson (1967) found that 
mastery of more sophisticated applications of probabilistic concepts 
may not appear until grade 9. Therefore, it seems reasonable to 
assume that any risk-taking characteristic might not become stable 
until the subjects had first mastered the concept of probability. 

Psychologists interested in using rtooe as 8 disguised measure of 
risk-taking need to keep in mind that rtooe appeared to be а stable 
characteristic for females only for the ninth to eleventh grade 
period. For males, rtooe was unstable over all grade intervals 
studied. Individuals involved in educational measurement need to 


consider that: (a) whereas 8 particular student tends to be quite 


consistent in rtooe at a given point in time, rtooe tends to be un- 


stable until at least grade 9—hence, the high rtooe student who is 
not penalized in an early grade (say sixth) may decrease іп 1006 


and be penalized in a later grade (say eighth) and vice versa, 


(b) test- ili titude and achievement score will 
) test-retest stability for apu (с) the rtooe 


tend to be lowered by the lack of stability in rtooe, 
Strategy in terms of maximizing average test score becomes poorer 
d (d) what ever rationale “do not guess 


over grades five to nine, an 1 
directions have, their ‘use before grade nine may be difficult to 
defend since there is some doubt that students have mastered the 
concept of probability as it relates to rtooe. 

In summary, mean rtooe tends to decrease over grades five s 
nine after which it appears to remain steady—at least until grade 
eleven. There was no evidence of a relationship between sex and 
rtooe, and evidence for an age by sex interaction only at the grade 
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five to seven period. Finally, there is evidence that rtooe is a 
temporary characteristic in the early grades, and does not become 
lasting for females until the nine to eleven grade period, and does 
not become lasting for males over the grade intervals studied. 
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THE EFFECT OF DOUBLE STANDARDIZED 
SCORING ON THE SEMANTIC DIFFERENTIAL 


JACK R. HAYNES 
North Texas State University 


ized scoring and from factor scores based on these two scoring 
methods. Groups were formed from these profiles by Wards 
Hierarchical Grouping Technique. Comparisons of the group 
memberships also demonstrated that the different scoring methods 
reveal different patterns. The major difference between the two 
methods appeared to be due to interindividual variability being 
associated with regular scoring while double standardized scoring 
reveals intraindividual differences. 


Aurnovcn the semantic differential ‘has been used in a great 
number of studies, most of these have utilized this technique to 
measure various theoretical concepts or to compare groups on ге- 
sponse patterns. A few studies have concentrated on the nature and 
consistency of the factor structure (Darnell, 1966; Elliott and Tan- 
nenbaum, 1963; Howe, 1964; Levin, 1965) while others have ех- 
amined the metric properties of the semantic differential (Heise, 
1969; Green and Goldfried, 1965; Messick, 1957; Miller, 1956; 
Osgood, Suci, and Tannenbaum, 1957). Most references to scoring 


procedures have been summarized in The Measurement of Mean- 


ing (1957). In general raw scores on the scales have been used and 
loyed. Ap- 


standard R technique factor analysis has also been emp'o 1 
parently very little has been done with the semantic differential 
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in the areas of grouping by profile similarity and the area of scoring 
effects on factor structure has also been limited. ^ 
Broverman (1961, 1962) has stated that the nature of the factor 
analytic approach and the scoring procedure should be influenced 
by the particular theoretical interest of the investigator. Broverman 
has pointed out the differences in the factor structure when the 
standard R technique procedures are used as contrasted with a 
double standardized data matrix. The effect of double standardiza- 
tion is to remove the between person variability from the correla- 
tions and thus the factor structure would reveal intraindividual or 
intrapsychic variations while the factor structure of the raw score 
data matrix would contain between person variability. If these 
effects are true, then grouping individuals on the basis of between 
person variability and on the basis of within person variability 
should produce different group memberships. The major purpose of 
this study was to investigate the effects of these two approaches 
on factor structure and on grouping by profile similarity in the 
semantic differential. A secondary purpose of this study was to 
compare the amount of error in grouping when raw scores versus 
factor scores are used. The following hypotheses were tested: (1) 
there will be a difference between the factor structure of a raw 
score data matrix and the factor structure of a double standardized 
data matrix, (2) group membership based on profile similarity of 
Taw scores will be different from group membership based on profile 
similarity of double standardized scores, (3) profile grouping based 


on factor scores will produce less error than profile grouping on 
raw scores. 


Method 
Subjects 


Two hundred college students, 79 females and 121 males, from 
general psychology classes were administered the semantic dif- 
ferential on two concepts. The Mean age was 20.14 years with a 
range from 17 years to 39 years. Thirty-one major fields were 
represented with 111 freshmen, 41 sophomores, 30 junior, 17 seniors, 
and 1 graduate student. 


Instruments 


Two concepts, myself and home, were scaled on 12 bipolar ad- 
jectives. The adjectives were selected from previous lists (Osgood, 
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Suci, and Tannenbaum, 1957) to provide representation of the three 
scales evaluation, activity, and potency and were scored on a seven 
point scale. The adjectives were: good-bad, beautiful-ugly, clean- 
dirty, large-small, heavy-light, strong-weak, active-passive, sharp- 
dull, fast-slow, kind-cruel, fair-unfair, and rugged-delicate. 


Procedure 


The subjects were given the two concepts to be scaled with the 
order of presentation of concepts and the sequence of bipolar ad- 
jectives being randomized. This procedure was done to reduce any 
response set, and the reason for two concepts was to investigate the 
equivalence or stability of the results. 

After the semantic differentials were completed, the following 
analyses were done on each concept separately: (1) a 200 x 12 
raw score data matrix was obtained yielding the raw scores for 
each individual on each of the 12 bipolar adjectives, (2) the data 
matrix was then converted to a double standardized data matrix 
by converting the scores in each column into 2 scores and then 
converting these standard scores into z scores in each row, (3) 
each of these matrices were factor analyzed by а principal axis 
method with unities in the diagonal. А Varimax rotation was applied 
to all factors with latent roots greater than one, (4) factor scores 
were computed for each subject from both the raw scores and double 
standardized scores, (5) Wards Hierarchical Grouping Technique 
(1963) was applied separately to each of the two sets of factor 
scores, A criterion of four groups,, within each scoring technique, 
was arbitrarily chosen for further comparison, (6) а comparison of 
group membership was done by uncertainty analysis reduction to 
determine if the different methods yielded different groups based on 


profile similarity. 


Results 


As shown in Tables 1-4, the rotated factor structures of the raw 
scores were different from those of the double standardized scores. 
The results of the uncertainty reduction analysis in Table 5 re- 
veals very low relationships between group memberships based on 


the different scoring methods. í 
As seen in Table 6, the least amount of group heterogeneity ог 


error is found when factor scores are used with the greatest re- 
duction occurring from factor scores based on the factor analysis 


of raw scores. 


110 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


TABLE 1 
Rotated Factor Matriz for Raw Scores on the Concept Myself 


Factors 
Variables I п ш 
Good-Bad .20 .02 .69 
Beautiful-Ugly 44 —.23 41 
Clean-Dirty .50 —.02 .55 
Large-Small 214 -78 11 
Heavy-Light -.18 81 Gees 
Strong-Weak .27 .69 .02 
Active-Passive NE -15 21 
Sharp-Dull .73 :08 .21 
Fast-Slow .84 .01 —.03 
Kind-Cruel .03 .10 .86 
Fair-Unfair .07 .08 .80 
Rugged-Delicate .31 .65 —.12 
Discussion 


Inspection of Tables 1-4 reveal that the factor structures for the 
Taw scores and the double standardized scores are different. They 
differ in both the number of factors and the pattern of loadings, 
which supports hypothesis number one. This finding would be in 
keeping with Broverman’s contention that two different psycho- 
logical pictures are obtained from these separate analyses. The 
analysis of the raw data yielded the usual factors of evaluation, 
potency, and activity, but when only intraindividual variability 
was analyzed, the results were different. MacAndrew and Forgy 
(1963) have criticized Broverman’s findings on the basis that 


TABLE 2 
Rotated Factor Matriz for Raw Scores on the Concept Home 


Variables 1 Vicus ш 
Good-Bad 
E 8 

Beautiful-Ugly B fon 20 
Clean-Dirty 64 —02 9 
Large-Small 07 .83 .02 
Heavy-Light Ed 85 —.06 
Strong-Weak ‘За 93 68 
Active-Passive .24 102 ‘81 
Sharp-Dull .26 .02 EC 
Fast-Slow .21 .08 81 
Kind-Cruel 185 03 54 
Fair-Unfair 82 ‘01 116 


Rugged-Delicate .02 .55 .25 
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TABLE 3 
Rotated Factor Matriz for Double Standardized Scores on the Concept Myself 


Factors 
Variables І п ІП ТУ У 
Good-Bad .01 —.06 -.15 Bt 417 
Beautiful-Ugly al .14 —.78 .04 —.14 
Clean-Dirty .23 —.10 .04 ‚75 .18 
Large-Small —.77 :19 418 —.10 .04 
Heavy-Light -.81 .15 .16 —.12 .08 
Strong-Weak -.11 42 .31 .01 .57 
Active-Passive .52 AT 46 09 —.04 
Sharp-Dull .15 17 .19 09 —.84 
Fast-Slow .60 .37 .08 —.14 —.26 
Kind-Cruel .02 —.78 —.02 224 08 
Fair-Unfair .07 —.87 .07 —.07 —.04 
Rugged-Delicate .07 .05 .20 —,82 .24 


Rugged-Delicate 64 4550 00 ------ 


Broverman's method of factor extraction produced rotated factors, 
and when rotated factors extracted from the principle components 
R technique was used, the results were comparable. The present 
study, however, revealed distinct differences between the rotated 
factor structure of the raw scores and the double standardized 
scores even though a principal axis solution was used in both in- 
stances. 

When the usual R technique was used on the semantic differential, 
the three scales of evaluation, potency, and activity were found. 
These factors would be descriptive of the meaning of the concepts 
for individuals on a normative basis while the factors extracted 


TABLE 4 


Factors 


Good-Bad —.61 .22 -.07 —.33 
Beautiful-Ugly .14 -.02 -.02 5 p 
Clean-Dirty —.21 .10 18 200 
Large-Small .45 .50 .24 oi 
Heavy-Light 44 .39 ‚34 . 2L 
Strong-Weak .03 —.35 —.68 `0 
Active-Passive .26 .19 -.80 202 
Sharp-Dull All -.15 —.13 TA 
Fast-Slow b ii —.76 .08 Ot 
Kind-Cruel -.17 е - 15 "us 
Fair-Unfair -.17 2 i HA 


Rugged-Delicate 31 .21 
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TABLE 5 
D Values from Uncertainty Reduction Analysis for Group Membership Similarity 


Score* Concept 
Comparison Myself Home 
1 and 2 .25 .21 
3 and 4 .15 EY 


* 1 = raw scores. 
2 = double standardized scores. 
3 = raw data factor scores. 
4 = double standardized factor scores. 


from the double standardized scores would be descriptive of the 
meaning of the concepts on an ipsative basis, Whether individuals 
score high or low on any of these latter factors would be deter- 
mined by the relationships between their standings on the other 
ipsative factors. 

The results in Table 5 also demonstrate a very low relationship 
between the profile similarities of the different scoring methods. 
This finding substantiates hypothesis number 2, for if there were 
no differences between the factor structures, then people should 
have similar profiles regardless of scoring method used, but this 
situation was not found in the present study. 

The effect of score transformation can be demonstrated in an 
example with hypothetical subjects. Table 7 contains the raw 
scores and profile difference values for four hypothetical individuals. 
The d values are the summed squared differences between indi- 
vidual profiles. If the grouping is based on these values, Ss 1 and 2 
would form one group while another group would be formed by Ss 
3 and 4. Thus the grouping is greatly influenced by normative rela- 
tionships or level of performance. This type of grouping would be 
appropriate when the investigator was interested in groups formed 
primarily from interindividual differences. When the scores in Table 
7 were converted to double standardized scores, as shown in Table 


TABLE 6 
Error Magnitude within the Hierarchically Formed Groups 
Concept 
Type of Scores Myself Home 
Raw Data 143.33 166.14 
Double Standardized 81.31 80.28 
R. D. Factor Scores 31.24 43.74 


D. S. Factor Scores 77.41 75.51 
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ТАВЬЕ 7 
Raw Scores and d Values for Four Н! ypothetical Subjects 


Factors d Values 
Subjects I п ш 1 2 3 4 
1 7 5 6 14 48 62 
2 4 7 5 46 34 
3 3 1 2 14 
4 1 4 1 


8, one group would be formed by Ss 1 and 3 while Ss 2 and 4 
would form the other group. The absolute size of the d values would 
differ due to the magnitude of the measures in Table 8. The utiliza- 
tion of the double standardized scores would generate profile sim- 
ilarity based upon shape rather than level of performance. Thus 
within semantic differential data, double standardized scoring 
would yield different, patterns of meaning for concepts than the raw 
score profiles. 

Another approach would be to group individuals on the basis 
of both kinds of scores to achieve greater homogeneity with respect 
to level and shape. Grouping individuals on profile similarity and 
then looking for.other common traits has certain advantages over 
the usual approach of comparing established groups оп some in- 
strument. This latter approach often leads to considerable vari- 
ability within groups which may disguise the nature of important 
characteristics, for the basis for group membership is not made on 
psychometric similarity. Profile analysis, however, may reveal im- 
portant, sub-groupings within larger classifications. 

Тһе third hypothesis predicted less error for factor scores than 
for raw scores. The error values in Table 6 support this hypothesis. 
The reason for this error reduction would be due to proper weighting 
of the variables in factor scores, and since there are fewer factors 
than variables, less cumulative error when computing difference 


TABLE 8 
Double Standardized Scores and d Values on the Scores in Table 7 
Factors d Values 
Subjects I П ry Pied 2 3 4 

1 10 -14 4 11.76 .08 12.29 
2 —1.2 1.2 0 11.63 .69 
3 11  -13 E 11.90 
4 —:8 14 -Л 
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values for the profiles. If the variables all had equivalent loadings 
on a factor, little difference would be found between factor scores 
and summing raw scores, If, however, there is considerable vari- 
ability among the factor loadings, factor scores would be more 
accurate, for the appropriate contribution of each variable to the 
total would be achieved. 


Summary 
When the factor structures and profile groups for raw scores and 
double standardized scoring were compared, differences were found 
which were directly related to the scoring procedure. Analyses 
based on raw scores tend to yield results related to interindividual 
variability while double standardized scoring yielded intraindi- 
vidual or ipsative sources of variation. 
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ITEM-ANALYSIS OF JOURARD'S 
SELF-DISCLOSURE QUESTIONNAIRE-21 


W. BARNETT PEARCE 
University of Kentucky 


BERNIE WIEBE 
Freeman Junior College 


tween high and low disclosers. Although most disclosure was 
reported about lowly intimate and least about highly intimate 
items, three of the four pest discriminators between high and low 
disclosers were judged by Jourard to be of moderate intimacy. 
Further, those items which discriminated best differed between 


male and female subjects. 


Most questionnaires used in self-disclosure research closely re- 
semble Jourard and Lasakow’s (1958) instrument. Reliability and 
validity data summarized by Jourard (1971) support the use of 
questionnaires to measure self-disclosure. 

One of the most convenient SDQ's is Jourard's (1971, p. 215) 
21-item instrument. The purpose of this study was to determine (a) 
the reliability and discrimination values of each item; (b) whether 
items judged highly intimate by Jourard discriminate between high 
and low disclosers more than items judged lowly intimate; and (с) 
sex differences in disclosure. 


Copyright © 1975 by Frederic Kuder 
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Method 


Undergraduates (181 males and 150 females) at three Une __ 
versities described their disclosure to their best friend, to a friend _ 
or to an acquaintance. Internal consistency coefficients were cal- 
culated for male and female Ss and total №. Correlation coefficients 
were computed between each item and the summated ratings 10 М 
the total instrument. The discrimination value of each item was | 
determined by observing the size of the t-value of the difference | 
between the scores of the upper and lower thirds of the sample. 
Summed scores and discrimination values of each item were com- i 
pared to identify sex differences. All computations were performed | 
on an IBM 360/40 computer using a summated ratings scaling 
program SUMRAT9 (Brickner, Daucsavage, and Kern, 1972). 


Results 


Alpha coefficients of internal consistency (Cronbach, 1970) were " 
90 for the combined sample, .88 for male and .91 for female Ss. _ 
Pearson r's between each item and the whole test were generally | 
higher for items highest in discrimination power. i 

Table 1 reports the correlation coefficients between each item and | 
the whole-test scores, and ranks by size of t the relative discrimina- _ 
tion power of each item. Means and deviations for the total test 
and items at each intimacy grouping are given in Table 2. М 

Disclosure about items 17, 8, 5, and 3 discriminated most between | 
high and low disclosers. Least discrimination power was found in 
items 7, 11, 13, and 19. Of the seven items in the high intimacy | 
group, #8 and #6 were in the upper third in discrimination power _ 
for males, and #3 and #8 for females, Mean disclosure for high 
intimacy items was lower than that for low intimacy items. For 
males, 6 of the 7 low intimacy items (all but #1) were in the lowest 
third in diserimination power; items 7, 11, and 13 were in the lowest 


third for females. There was little difference in total disclosure for | 
male and female Ss. 


Summary 


Reliability scores, both for the total test and for each item, are 
acceptable, but Jourard's identification of intimacy levels is only | 
partially supported by discrimination values. Different items dis- _ 
criminated high and low disclosers among male and female $5. 
although total disclosure was comparable. These data suggest that 
this SDQ is an acceptable instrument but that intimacy designa- - 
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tions do not necessarily identify items which distinguish high and 
low disclosers. Items in the upper third in diserimination value for 
both male and female Ss involved regretted acts, feelings of mal- 
adjustment and immaturity, bothersome habits and unhappiest 
moments. 


TABLE 1 
Item Rankings and Correlation Coefficients 


Rankings by ¢-value* Pearson г 
TotalN М Е TotalN М Е 
1. Views оп marriage roles 5 3 10 .70 .66 “ТТ. 
(Ly 
2. Depression, anxiety and 17 11 17 .68 .65 ‚72 
anger (M) 
3. Most regretted acts and 4 5 6 ‚71 .63 78 
why (Н) 
4. Religious views and 1 19 9 67 .56 .77 
participation (L) 
5. Feelings of maladjustment 3 6 2 Khl . 65 .16 
and immaturity (M) 
6. Guiltiest secrets (Н) 13 4 20 . 61 .62 .62 
7. Views оп politics (D) Но. 
8. Bothersome habits and 2 2 3 at .73 .80 
reactions (M) 
9. Dissatisfaction with 12 18 7 69 59 79 
opposite sex (H) 
10. Erotic play and sexual 14 8 М .62 .62 63 
lovemaking (H) 
11. Hobbies and leisure time 20 20 18 .68 :55 Bd 
(L) 
12. Happiest occasions in EA АЛМАТ о BR 
life (M) 
13. Aspects of daily work (0) 19 16 16 69 m m 
14. Positive personal charac- 16 422419 69 . 
teristies (M) 
15. Persons most resented (H) 8 15. GYI y^ js И 
16. Sexual intimacies (H) 10 12 12 . 1 i я 
17. Unhappiest moments (M) 1 1 1 n al BA 
18. Music preferences and 9 15 8 e И 
dislikes (L) 
19. Personal goals (1/) 18 37:019; i d id 
20. Personal depression and 6 9 5 i 
hurt feelings (M) 
21. Sexual fantasies and i15 10 45.0 7161 .62 62 


reveries (H) 


Note. Тһе 21 items are abbreviated forms of those used in the SDQ published in Jourard’s 


Self-Disclosure, p. 215. 
* For all t values, р < .05. APO А inti- 
$ Paronthetienl labels refor to intimacy ratings. Н denotes bigh intimacy; М, moderate Inf 
macy; and L, low intimacy. (cf: Note above). 
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TABLE 2 
Means апа Standard Deviations 
21-items 7-item intimacy groups 
Total N M F H M L 


Mean total scores 45.97 46.43 45.42 14.02 15.27 16.68 
Standard deviation 10.87 9.56 12.37 4.10 4.07 3.80 
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THE CLASSROOM BOUNDARY QUESTIONNAIRE: 
AN INSTRUMENT TO MEASURE ONE ASPECT OF 
TEACHER LEADERSHIP IN THE CLASSROOM 


THOMAS L. MORRISON? 


Department of Psychiatry 
University of California, Davis 


Two studies were done to operationalize the concept of social 
system boundaries 88 applied to teacher control in classrooms. 
In Study 1 a multiple choice Classroom Boundary Questionnaire 
(CBQ) was developed to measure teacher preference for boundary 
control. The 25 items were found to load primarily on one factor 
and to have adequate split-half reliability (corrected г = .85). In 
Study 2, observations jn 32 4th through 6th grade classrooms 
found that observational measures of teacher boundary control 
behavior could be reliably recorded and were correlated with 
teachers’ CBQ scores. Degree of boundary control preferred by 
the teacher on CBQ and the frequency of child-initiated boundary 
crossing events allowed in the classroom were negatively correlated 


(r = — 48, p < 01). 


MAINTAINING control over the behavior of children іп their class- 


rooms has long been an important practical problem for teachers. 


A major complaint of teachers about their preparation for teaching 
sufficiently 


is that the issues of control and motivation are not 
emphasized (Wright and Tuska, 1968). For many years, there was 


1 This paper is based on sections of 8 dissertation submitted to the Psy- 
chology Department of Yale University. in partial fulfillment of the require- 
ments for the PhD Degree. The author wishes to express appreciation to 
Dr. James C. Miller, chairman of the dissertation committee, and to Drs. 
Donald Quinlan and Claude Buxton, members of the committee. The research 
was supported by a Predoctoral Research Fellowship from the National 
Institute of Mental Health (5 FOL MH49563). 

The author's address is Department of Psychiatry, 
2315 Stockton Blvd., Sacramento, California 95817. 
Copyright © 1975 by Frederic Kuder 
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a lack of research evidence on which to base statements about the 
effects of teacher control (Ladd, 1958a, 1958b; Sheviakov and 
Redl, 1956). Though the amount of research on classroom inter- 
action has increased in recent years, much of the social psychological 
and educational research on teacher control in classrooms has re- 
mained disjointed and without firm grounding in psychological 
theory (Morrison, 1972). 

Recently, writers have emphasized the possibilities inherent in 
viewing the classroom as a group or social system, and the teacher 
as the group leader on manager (Getzels and Thelen, 1960; Jensen, 
1960; Jenkins, 1960; Roberts, 1971; Schmuck and Schmuck, 1971). 
Іп this context, the question of teacher control becomes опе aspect 
of teacher leadership in the classroom group. This paper uses the 
concept of boundary, developed in Tavistock social systems theory 
(E. J. Miller, 1959; Е. J. Miller and Rice, 1967), to conceptualize 
questions relating to teacher control of child behavior in classrooms. 


Study 1: Questionnaire Development 


The Concept of Classroom Boundaries 


In social systems theory, the concept of boundary is used to 
analyze interactions between and within groups. Boundaries, which 
can be physical but need not be, occur at points of discontinuity 
in space, time, or behavior. A discontinuity is a boundary if there is 
control or regulation of transactions across it (J. С. Miller, 1971). 
It is an important function of the Management of an organization 
to regulate transactions across the boundaries between the organiza- 
tion and other social systems in its environment. The management 
must also regulate transactions across the boundaries that separate 
subsystems within the organization itself. All systems can be 
thought of as having a task that requires taking in materials from 
the environment, processing them, and distributing a product. In 
order for the task to be accomplished effectively, the management 
of the system must regulate the flow of the material as it passes 
across the boundaries from one processing system to the next. That 
is, the operations required for doing the task must be coordinated. 

The classroom can be thought of as a complex social system in 
which each child is a subsystem and the teacher is the manager. 
Тһе task of the classroom group is the production of learning among 
its members. As the manager of the classroom group, the teacher 

has to decide how to control the transactions among group members 
so that the task of learning is accomplished effectively and efficiently- 
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A classroom with high boundary control would be one in which 
the transactions among students and between a student and the rest 
of the class were carefully regulated by the teacher. In such а 
classroom, for example, children would not talk to other children, 
would not leave their seats to go to another part of the classroom, 
and would not make a statement to the entire class unless the 
teacher specifically initiated or approved such actions. In a elass- 
room with low boundary control, on the other hand, children would 
be free to initiate conversations with other children and would have 
free access to facilities in the room. With regard to the sequence of 
work tasks, the teacher in à classroom with high boundary control 
would specify in detail the task to be done and the time when it 
was to be done. In a classroom where the time boundary was less 
controlled, children would have more choice when to do their work. 

Most elementary school classrooms have little direct interaction 
with the surrounding environment, so that most issues of boundary 
control relate to internal boundaries. However, there are some deci- 
sions that the teacher must make about the relationship between 
the classroom and the larger system of the school. In particular, the 
teacher must decide about the conditions under which children can 
leave the classroom: whether children can decide to leave the room 
themselves (low boundary control), whether they must first seek 
permission, or whether they may leave only at times indicated by 
the teacher (high boundary control). In general, then, the question 
of boundary control in classrooms relates to the degree of constraint 
on children with regard to their use of time and space in the class- 
room and with regard to the kinds of interactions they can have 
with other members of the class. This paper reports on the develop- 
ment of a questionnaire designed to operationalize the concept of 
boundary control in classrooms. 


Method 
onnaire, the Classroom Boundary Ques- 


tionnaire (CBQ), was constructed to measure the boundary condi- 
tions that teachers say should prevail in their classrooms. Thirty 
questions were written to present a brief boundary-related situation. 
Following each situation was а list of three or four possible sub- 
sequent courses of action. Respondents were asked to choose the 


alternative that most nearly represented the behavior preferred in 


their classrooms. The alternatives within each question varied along 
{ boundary control exercised 


a continuum that reflected the extent 0: 
by the teacher in the classroom. Items covered teacher preferences 


A multiple choice questi 
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for control over three kinds of boundaries: space boundaries, time 
boundaries, and behavior boundaries. Brief descriptions of these 
boundaries (and examples of questionnaire items) follow.? 


Control of Space Boundaries 


А child's assigned seat can be thought of as a bounded space. 
Questions relating to space boundaries asked teachers for their 
opinions about conditions under which children ought to be allowed 
to eross the boundary around their seats. A child can cross this 
boundary by talking to another ehild or by leaving his seat. In 
some classrooms, children сап cross this boundary whenever they 
wish; in others, they may do so only with the permission or at the 
instruction of the teacher, The following item is about the boundary 
around the child’s seat: 

After finishing his assigned seatwork, 

(a) a child should feel free to leave his seat in order to get 
materials from thé classroom library or from another part 
of the room. 

(b) a child should be permitted to leave his seat only after re- 
questing the teacher for permission to do so. 

(с) a child should stay in his seat until the teacher directs him to 
some other activity. 

Another space boundary is the boundary between the classroom 
and the rest of the school. Some questions asked about conditions 
under which children were allowed to cross this boundary. The 
following is an example: 

A child who wishes to leave the room during a class period 

(e.g., to go to the bathroom or get a drink of water) 

(a) should feel free to get up and leave quietly. 

(b) should be allowed to do so, but only after asking permission 
from the teacher. 


(c) should be told to wait until recess or another scheduled break 
period. 


Control of Time Boundaries 


The classroom day is bounded at the beginning and the end by 
clear times for the start of school and for dismissal. Within those 
limits teachers can vary in how much freedom of choice they allow 
children in their use of class time. Questions about time boundaries 


2 Copies of the Classroom Boundary Questionnaire and information about 
scoring are available on request from the author. 
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attempted to ascertain whether teachers allowed children some 
latitude in choosing when to do work, or whether they structured 
the children’s use of time more closely. The following question is an 
example: 

With regard to seatwork assignments 

(a) a child should feel free to work on his assignments in what- 
ever order he wishes, provided he gets all his work done by 
the end of the day. 

(b) when a particular seatwork lesson (e.g., Arithmetic or Spell- 
ing) is assigned, the child should work only on that lesson 
until the teacher directs the class to a new lesson. 

(c) a child should not have to turn in routine assigned work on 
time if he is working on something else that is interesting 
to him. 


Control of Behavior Boundaries 


Some items were written to determine how broad a range of 
behavior was acceptable in the classroom, or how clear the boundary 
was between acceptable and unacceptable behavior. Teachers were 
asked what kinds of physical and verbal behaviors were “within 
bounds.” The following question is an example: 

With regard to language in class, 

(a) children should feel free to use the language that comes 
naturally to them, even if it is ungrammatical (ie., col- 
loquialisms, slang, "ain't," ete.). Swearing, however, should 

not be allowed. Pat 

(b) children should feel free even to swear in class if this ex- 
presses how they feel. 

(с) children should watch their language in the classroom: they 
should be polite and try to use correct English. 


Judges’ Ratings 


Establishing a scoring system for the questionnaire involved 
verifying that the alternative choices within each question did vary 


along a continuum reflecting teacher control of the bosses can 
-poin 


alternative within each of the 30 questions was rated on a 9-рс 
scale by 26 psychologists and social workers, producing 96 ratings 
for each rater. The end points of the scale used by the raters were 
labelled (a) Not permissive vs. Very permissive; (b) Not much 
freedom allowed vs. Very much freedom allowed; and (e) Very 


much structure imposed vs. Not much structure imposed. The raters 
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were instructed to assign a different value to each alternative 
within each question. 

Тһе raters showed high agreement. For each of the 96 alterna- 
tives, à mean rating was determined, and the analysis of variance 
procedure suggested by Winer (1962, pp. 124—132) was used to 
estimate the reliability of each mean rating. The resulting reliability 
figure was r — .99. One question that showed particular disagree- 
ment among raters was dropped at this time. 


Subjects 


Sixty-two graduate students of education served as Ss. Of these, 
17 were in three sections of a practicum course for teachers study- 
ing to be guidance counsellors and 45 were in two sections of a 
graduate course in child development. Eighteen were men, the mean 
age was 27.4 years (SD — 5.6), and the mean number of years 
teaching experience was 2.9 (SD — 2.3). All but six of the Ss had 
had actual teaching experience beyond student teaching, and all 
but three of the rest were employed as teachers at the time they 
answered the questionnaires. 


Questionnaire Administration 


Class time was made available for E to describe his research and 
for class members to complete several questionnaires. E briefly dis- 
cussed the lack of research on classroom discipline, and described 
his project as a study of “how teachers’” attitudes and opinions 
relate to their decisions about what kinds of things should be allowed 
to happen in classrooms.” The teachers were then asked to complete 
four questionnaires, three of which? were characterized as “ways of 
measuring attitudes and opinions.” They were told that the fourth 
questionnaire (the CBQ) was developed by Е “іо find out what 
teachers think should happen in classrooms.” CBQ items were de- 
signed to refer to 4th, 5th, and 6th grade classrooms. Teachers who 
did not normally teach those grades were asked to respond in terms 
of what they would expect in their classrooms if they did teach at 
that level. Four persons in the counselling practica declined to 
participate. 


‚3 The personality questionnaires were the Marlowe-Crowne Social De- 
sirability Scale (Crowne and Marlowe, 1964), the Tomkins Left-Right Scale 
(Tomkins, 1963), and the Miller Boundary Questionnaire (Miller, 1968). 
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Scoring System. 


In scoring the questionnaire the mean ratings that had been 
derived for each alternative within each question were transformed 
into rank orderings. Thus, each alternative received a value ranging 
from 1 to 3 or 4 (depending on the number of alternatives within 
the question). The higher the score, the more control was repre- 
sented by the item. А total boundary control score was assigned 
to each S by summing the value for each alternative chosen. 

There were four questions on which 75% or more of the teachers 
chose the same alternative. These questions were dropped, leaving 
25 questions contributing to the score. The possible range of CBQ 
scores extended from 25 to 81. 


Results 


The CBQ produced a range of scores from 29 to 62, with a mean 
of 459 (SD — 88). With items assigned randomly to halves, the 
СВО had a split-half reliability, corrected by the Spearman-Brown 
formula, of r — .85. No test-retest reliability figure is as yet 
available. 

CBQ score did not correlate with the teacher’s age or amount of 
experience as a teacher. The mean CBQ scores of male and female 
teachers were not significantly different. The teachers in the coun- 
selling practica had a lower mean CBQ score than the rest of the 
teachers (Ё = 3.67, р < .001). 1 { 
Though no formal assessment of S's reactions to the questionnaire 
was attempted, it appeared that most teachers enjoyed the CBQ 
and found it relevant and realistic. Some complained that the multi- 
ple choice format kept them from saying how they really acted in 
the situation, or that it did not allow for the richness, complexity, 


and multiple contingencies of real classroom interaction. 


Factor Analysis 
The items of the CBQ were subjected to a factor analysis. In the 


unrotated factor matrix, all but four items loaded on the first factor 
for 26.2% of the variance. 


and 7.5% respectively, 
fter that. A varimax rota- 


room. 
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Study 2: Questionnaire Validation 


Validation of the Classroom Boundary Questionnaire involved 
the development of observational procedures to measure actual 
teacher boundary control behavior in classrooms. 

With regard to space boundaries, at least four situations relating 
to the boundary between a child and the surrounding social system 
сап be identified: а child leaving his seat, а child having a con- 
versation with another child or several children, a child talking 
aloud to the entire class, and child leaving the classroom. There 
are three possible ways for such boundary-crossing events to be 
initiated. The teacher сап instruct the child to cross the relevant 
boundary (e.g., asking the child to get up and collect papers), the 
child ean seek permission from the teacher before crossing the 
boundary (this is usually done by the child raising his hand), or 
the child can take the initiative and cross the boundary without 
referring first to the teacher. In the latter case, the teacher can 
either reassert control of the boundary by correcting the child's 
behavior (e.g, a simple reminder not to talk, a reprimand, or 
punishment), or the teacher сап implicitly allow the boundary 
crossing by not responding. 

Thus, important information about boundary control in the 
classroom would be contained in a record of the following kinds 
of boundary-related events: those resulting from a teacher instruc- 
tion, from teacher permission, from a child’s initiative but con- 
trolled by the teacher, and from a child’s initiative and implicitly 
allowed by the teacher. A high-boundary classroom would be one 
in which many boundary crossings resulted from teacher instruction, 
or one in which few boundary crossings resulted from child initiative. 
_ Another aspect of boundary control discussed previously, and 
included in the items of the Classroom Boundary Questionnaire, 
referred not so much to the boundary between each child and the 
surrounding social system but rather to the boundary around the 
task requirements. An observational estimate of the constraints 
imposed on children in doing their work would be contained in а 
record of the frequency with which the teacher gave specific direc- 
tions about how to do the work, checked on the progress of work 
being done, or referred to time limitations. 

The task of this study was to develop observational measures of 
teacher boundary control, to test their reliability and stability, and 
to investigate their relationship to teachers’ scores on the Class 
room Boundary Questionnaire, 
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Method 


Observations were done in 32 classrooms in the five schools of & 
single suburban elementary school system with а predominately 
white middle class and upper-middle class pupil population. There 
were 10 fourth grade, 11 fifth grade, and 11 sixth grade classrooms 
with a median class size of 21 (range 18-25). The distribution of 
classrooms by school was: 7 in school 1, then 9, 6, 5, 5 in schools 2 
through 5. The eight men and 24 women teachers had a mean age of 
34.5 (SD = 11.0) and a mean number of years teaching experience 
of 88 (SD = 7.1). The teachers volunteered to participate in the 
study after the investigator described the aims and procedures at 
faculty meetings at each of the schools.* 


Observational Procedures 

As part of a larger study on teacher-pupil interactions in the 
classroom (Morrison, 1972), each classroom was observed for four 
half-hour periods between January and March by à single research 
assistant experienced in observing groups. She had been carefully 
trained in using the observational categories but was not aware 
of the specific hypotheses of the study. To establish reliability, the 
investigator observed along with the assistant for four sessions in 
each of three of the classrooms at the beginning of the study. 
Within constraints of scheduling, the observations were spaced 
over the 3-month period, and teachers were informed in advance 
of the schedule. Each classroom was observed during periods of 
instruetion in several subject areas. 

During each classroom visit, there were two four- 
during which О observed the classroom as 8 whole and noted each 
instance of four boundary-related behaviors: when children (а) 
left their seats, (b) talked with other children, (c) talked to the 
entire class, and (d) left the room. Frequency counts of these 


events are referred to as the classroom movement variables. / 
how the behavior was 


For each instance recorded, O also noted vio 

initiated: (a) by teacher instruction, (b) by teacher permission, or 
(с) by the child’s own initiative. If the behavior was initiated by the 
child, the observer watched for the teacher’s reaction and noted 
whether the teacher (a) tried to stop the behavior or (b) implicitly 


minute periods 


po" of a total of 45 teachers in the school нл je ibo ride ка 
; but three six е teac! at one 
ge, 96 actus volunteered; bu E hing methods differed 


жеге omitted sample because their team teac У 
omitted from the 25% val procedures of other teachers in the system. 
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allowed the behavior by not responding.’ In most cases it was easy 
to record boundary crossing behavior with this system. It was more 
complicated when class projects required many children to be out 
of their seats. In such cases O first counted the number of children 
out of seat and then scanned the room looking for conversations 
among children. 

From these tallies four measures of boundary control were de- 
rived for each teacher for each period of observation: (1) Teacher 
Instruction was the sum of teacher-initiated events; (2) Teacher 
Permission was the sum of events for which the teacher gave per- 
mission before they happened; (3) Teacher Control was the number 
of times the teacher reprimanded a child or told a child or the 
class to stop a behavior; (4) Child Initiative allowed was the total 
number of events initiated by children in the classroom that the 
teacher did not try to stop. 

During the four-minute observation period, O also had the task 
of recording teacher comments that reflected attempts to control 
the boundary around the task, Three types of teacher statements 
were included: (1) specific directions about assigned work (state- 
ments in the imperative mode about work at hand); (2) directions 
about specific procedures (references to the proper or approved way 
of doing work); (3) references to the progress of work or to time 
limits. The variable called Task-Related Directions was the total 
number of statements in these three categories. This variable mea- 
sured aspects of teacher control over time boundaries and behavior 
boundaries in the classroom. 

The five variables described above, teacher instruction, teacher 
permission, teacher control, child initiative allowed, and task-related 
directions, are referred to as the teacher behavior variables. 


Questionnaire Administration 


After all observations had been completed, the investigator ad- 
ministered a set of questionnaires in each classroom. These included 
the My Class Inventory (Anderson, 1971), with several items added 
to determine the children’s perception of boundaries in the class- 
room (sample: ^Tt is all right to get up and walk around in class"). 
A mean score on this seale was computed for each classroom. The 
classroom teacher completed her own set of questionnaires at the 
back of the room while the investigator administered questionnaires 


5 A list of examples of incidents that would be included under each category 
is available on request. 
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to the children. The teachers identified themselves by name on the 
questionnaires, having been assured of complete confidentiality. The 
children did not indicate their names. 


Results 


Reliability and Stability of Observational Measures 


Agreement between observers on the observational measures was 
assesséd by the correlation coefficients between the investigator's and 
the observer's ratings over the 12 classroom sessions they observed 
together. They observed four sessions in each of three classrooms 
at the beginning of the study. The correlation coefficients for teacher 
behavior variables and elassroom movement variables shown in 
Table 1 confirm that the investigator and the observer agreed on 
the categorization of the various behaviors. 

The stability of each of the behavior categories over time was 
assessed by correlating the mean of the observer's ratings for the 
first and third observation periods with the mean of her ratings 
for the second and fourth observation periods. These coefficients for 
stability, corrected by the Spearman-Brown formula and shown in 
Table 1, are considerably lower than those for observer agreement. 
А repeated measures analysis of variance over the four observa- 


TABLE 1 
Observer Agreement and Stability for Observational Variables 
Observer 19 
Agreement* Stability’ 


Teacher Behavior Variables 


** 
"Teacher Instruction pon e 
Teacher Permission ex 10 
"Teacher Control "25 155** 
Child Initiative pui 2 
Task-related Directions 95 М 
Classroom Movement Variables 4 39% 
Children Out of Seat 9e 157% 
Children Talking with Other Children 88^ "Bl" 
Children Talking Out dH 122 


Children Leaving Room .96** 
* Observer agreement was assessed by the Pearson product-moment correlation between the 
investigates eil che observer's ratings aver 13 clas sessions that bots observed. 43 
b Stability of the observer's ratings was assessed by the correlation between the та öt the 
observer's first and third ratings of a classroom and the mean of her second and am ratings, 
over all 32 classrooms. This correlation was adjusted by the Spearman-Brown formula. 
ap < 05. 
жр< 01. 
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tional periods showed no main effects for time period. After these 
analyses, a mean for each variable over the four time periods was 
computed. These means were the data used in subsequent analyses. 


Relationships among Observational Variables 


The only significant correlation among the observational mea- 
sures of teacher behavior was between Child Initiative allowed and 
Teacher Control (r = .69, p < .01). The relationships among the 
variables were explored by a factor analysis (Table 2) which 
revealed two factors with latent roots greater than 1.0 that together 
accounted for approximately 60% of the variance of the teacher 
behavior variables. A varimax rotation of the first two factors did 
not change the pattern of factor loadings found in the principal 
components solution, The first factor had three variables with high 
positive loadings: teacher permission, teacher control, and child 
initiative. The other two teacher behavior variables, teacher instruc- 
tion and task-related directions, had high positive loadings on the 
second factor. The factor loadings, and especially the high positive 
correlation between the number of control attempts made by the 
teacher and the number of child-initiated events allowed by the 
teacher, suggest that an important underlying dimension in these 
classrooms was the amount of child activity. 


Relationships with the Classroom Boundary Questionnaire 


Teachers’ Scores on the Classroom Boundary Questionnaire were 
related in the expected way to the amount of movement in their 
classrooms (Table 3). Score on СВО, i.e., the amount of boundary 
control teachers said they preferred, was negatively correlated with 
the number of times children left their seats (г = —40, р < 05) 
and with the amount of child-child talk (r = — 44, р < 05). The 
negative correlations with talking out and leaving the room were 


TABLE 2 
Principal Components Factor Analysis of Teacher Behavior Variables 


Factors 
1 2 3 4 5 
Teacher Instruction -. - -.04 
Teacher Permission ri - M = t " E 0 
Teacher Control .88 19 05 - 24 —.37 
Child Initiative ОО ата 1.128 36 
Task-related Directions .14 83 44 180 10 
Latent Root 1.4 1.00 0.96 0.73 0.28 
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TABLE3 
Correlations Suggesting the Validity of the Classroom Boundary Questionnaire (CBQ) 
М = 32 

Child Movement Variables сво 
Out of Seat —.40* 
Talking with Classmates —.44* 
"Talking Out —.23 
Leaving Room —.29 


Teaching Behavior Variables 


Teacher Instruction —.13 
Teacher Permission —.31 
Teacher Control — .30 
Child Initiative —.48** 
Task-related Directions —.01 
*p < .01. 
**p < 05. 


not significant but were іп the expected direction. Further, the 
correlation between CBQ score and the questionnaire about bound- 
ary control that was administered to the children in each classroom 
was r = .59 (p < .01). That is, children in the classrooms reported 
experiencing the boundaries that the teachers said they maintained. 

The correlations of the teacher behavior variables with CBQ are 
also shown in Table 3, CBQ score was negatively related to the 
amount of child-initiated boundary crossing that was allowed by 
the teacher (r = —48, р < 01). A stepwise multiple regression 
analysis showed that the multiple correlation between CBQ and 
the teacher behavior variables increased only to В = 53 when 
teacher permission and teacher instruetion were included with child 


initiative as predictor variables. 


Discussion 

This study suggests that the Classroom Boundary Questionnaire 
is a useful instrument for measuring teachers’ preferences about 
boundary control in the classroom. The judges’ consistent ratings 
in Study 1 suggested that the concept of boundary control was a 
clearly definable one; the factor analysis suggested that it was а 
unidimensional one. Study 2 showed that measures of teacher 
behavior derived from the concept of boundary control could be 
recorded reliably and were moderately stable across periods of 
observation. Teachers’ scores on the CBQ were found to be in accord 
with the behavior observed in their classrooms: teachers who re- 
ported a preference for more control over the boundaries had class- 
rooms in which there was in fact less movement. Most important, 
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there was а significant negative correlation between the degree 
boundary control preferred by the teacher and the frequency: 
child-initiated boundary crossing events allowed. 

Research relating to teacher control in classrooms began 
years ago. Lewin and his colleagues studied the effects of authi 
tarian, democratic, and laissez-faire adult leadership on child bi 
havior in small activity groups (Lewin, Lippitt, and White, 1939; 
White and Lippitt, 1968). Anderson used similar concepts 
dominative vs. integrative leadership in his studies of classroon 
behavior (Anderson and Brewer, 1945; Anderson and Brewer, 1946; | 
Anderson, Brewer, and Reed, 1946). A 

This early research was imaginative, and since then the concept 
of authoritarian or controlling behavior has been intuitively com- 
pelling to researchers. However, research applying these concepts о 
classroom interaction has not been fruitful (В. Anderson, 1959). 
The difficulties have been twofold: the research has not been based | 
on psychological theory, and the concept of control has been con- 
founded (Morrison, 1972; Smith and Hudgins, 1967; Wallen and 
Travers, 1963). 4 

In most studies, controlling teachers have been defined as those 
who (a) set distinct limits that restrict the child's freedom of action 
and (b) respond to the breaking of limits in a punitive manner. 
The assumption has been that limit-setting and a tendency to be 
cold or punitive are highly correlated. Some data exist to suggest. 
that this is not true (Christensen, 1960; Wright and Sherman, 1965) 
The confusion of these two aspects of teacher control has led 10 
conflicting predictions about the effects of teacher control. In tw 
recent studies one author predicted that higher teacher con! 
would be associated with more stress and consequently less achieve- 
ment in the classroom (Soar, 1967). Another author predicted th 
permissiveness would be negatively related to achievement (Chris- 


The concept of boundary control is derived from a theory 
social system functioning (Miller and Rice, 1967) and helps 
clarify what is meant by teacher control in the classroom. Аз 
leader of the classroom group, the teacher is responsible for crea 
conditions that allow the group to do its work. This means 
the teacher must make decisions about how to control the іш 
actions among the children in the group in such a way as to fa 
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tate learning. This eoncept of control is not confounded. It clearly 
refers to the setting of limits, not to the degree of punitive behavior 
by the teacher. "Thus, it is hoped that the measures of boundary 
control presented in this paper will be useful instruments in clarify- 
ing the study of teacher control in the classroom. 
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COMPLEX ALTERNATIVES IN MULTIPLE 
CHOICE ACHIEVEMENT TEST ITEMS 


DANIEL J. MUELLER 


| AN ASSESSMENT OF THE EFFECTIVENESS ОҒ 
| Indiana University 


Diserimination indices and difficulty levels of multiple-choice 
achievement test items containing only substantive response alter- 
natives were compared with items containing each of three 
complex alternative types: All of the above; None of the above; 
and combination complex alternatives (ед, А and B; Either 
A or B; Two of the above, etc.). The same item statistics were 
compared, with items reclassified according to type of alterna- 
tive keyed as the correct answer. Ttems containing combination 

4 complex alternatives were found to be most difficult, and items 
containing an “АШ of the above" alternative were found to be 
least difficult (especially when that alternative was keyed as the 
correct answer). Discrimination index was less affected by the 
inclusion or exclusion of complex alternatives than was difficulty 
level, but the highest discrimination occurred in items containing 
only substantive alternatives; the lowest in items in which “None 
of the above” was keyed as the correct answer. It was also found 
that all three types of complex alternatives functioned better as 
distractors than did substantive alternatives, with combination 
complex distractors receiving the highest rate of response. 


Wim there appears to be a high degree of agreement among 
test construction experts regarding the nature and importance of 
item writing principles, the empirical evidence supporting the 
validity of most of these principles is sketchy or nonexistent. This 
study will examine the effectiveness of three types of complex 
alternatives in multiple choice items: None of the above; All of 
the above; and “combination” alternatives (such as A and B, A 
and C, Either A or B, Two of the above, and the like). 

Studies of the effectiveness of complex alternatives have had 
Copyright © 1975 by Frederic Kuder 
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mixed findings, Boynton (1950) concluded that use of the “None 
of these" alternative made items more difficult and better discrimi- 
nating. Wesman and Bennett (1946) found the “None of these" 
alternative to be very effective in certain items, but not in all 
items. Rimland (1960) found no advantage in using the "Right 
answer not given" alternative. Williamson and Hopkins (1967) 
found that incorporating the “None of these" alternative in test 
items had no effect on test reliability or validity, but made items 
more difficult. And Hughes and Trimble (1965) concluded that 
complex alternatives can increase item difficulty (especially com- 
bination complex alternatives) but have little or no effect on item 
discrimination. 


Method 

The present study compares discrimination indices and difficulty 
levels of items containing only substantive alternatives with those 
of items containing, respectively, each of the complex alternative 
types described above. A second series of comparisons follows the 
same pattern, but rather than grouping items according to the 
presence of a particular alternative in an item, items are classified 
according to the type of alternative which is keyed as the correct 
answer, Thus, discrimination indices and difficulty levels of all items 
in which a substantive alternative is the correct answer are com- 
pared with the same item statistics of all items in which “All of 
the above” is the correct answer, ete. Lastly, the usefulness of each 
of the three complex alternative types as distractors is examined. 

Ttem statistics are from six unit examinations administered to 
students enrolled in the Indiana Approved Real Estate Salesmen’s 
Course. Successful completion of this course is required by law 
before prospective real estate salesmen and saleswomen in Indiana 
can take the Real Estate Salesmen’s License Examination. The 
course is offered three times a year at 28 locations throughout the 
state. Enrollment varies from around 300 in the Summer to between 
700 and 1000 during the Fall and Spring semesters. The examinations 
utilized in this study are from four recent terms. Each examination 
contains one hundred items, more than half of which are multiple- 
choice. Each multiple-choice item has five alternatives. All ex- 
aminations have KR-20 internal consistency coefficients between 


.84 and .86. 

Results 
Table 1 shows mean difficulty level and mean discrimination 
index, by test and across tests, for items containing only substantive 
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alternatives, and for items containing, respectively, each of the 
three complex alternative types. On all but one test the highest 
mean discrimination index occurred in items containing only sub- 
stantive alternatives. Items containing each of the three types of 
complex alternatives diseriminated, on the average, about as well 
as one another. Differences in mean difficulty level were more ex- 
treme. Least difficult, on the average, were items containing only 
substantive alternatives (p = .79) and items containing the “АП 
of the above" alternative (p — .78). Somewhat more difficult were 
items containing the “None of the above" alternative (p = .74). 
By far the most difficult were items containing combination complex 
responses (p — .64). 

Item statistics in Table 2 are from items classified according to 
the type of response alternative which was keyed as the correct 
answer. While there was not a great deal of variance in discrimina- 
tion indices, it appears that items in which a substantive response 
was the correct answer discriminated better, on the average, than 
did items in which any of the complex alternative types was the 
correct answer. Items in which “All of the above” was the correct 
answer were clearly easier, on the average, (p = .82) than were 
items in which any other alternative type was the correct answer. 
Most difficult were items in which a combination complex alterna- 
tive was the correct answer (p = .64). Items with a substantive 
alternative as the correct answer and items with “None of the 
above” as the correct answer were intermediate in difficulty (p = .77 
and .76 respectively). 

Table 3 indicates the usefulness of each of the alternative types 
as distractors, On the average, each time a combination complex 
response was not the correct answer .13 of the students selected it, 
compared with .10 for “None of the above,” .07 for “All of the 
above,” and .05 for all substantive wrong alternatives. 


Discussion and Conclusions 


Clearly, in this study, the inclusion or exclusion of complex alter- 
natives had a marked effect on item difficulty, with items containing 
combination complex alternatives being the most difficult. (whether 
or not the combination alternative was the correct answer), 9n 
items containing an “All of the above” alternative being the easiest 
(especially when that alternative was keyed as the correct answer). 
Discrimination indices were less affected by the inclusion or exclu- 
sion of complex alternatives, The highest mean discrimination index 
(r = .30) occurred іп items containing only substantive responses: 
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The lowest mean discrimination index (r = .25) occurred in items 
in which “None of the above” was keyed as the correct answer. 

Two qualifications are in order in generalizing from these find- 
ings. Examination of the proportionate use of complex alternatives 
as the correct answer relative to the inclusion of these alternatives 
(as right or wrong answers) in test items indicates that the “All 
of the above” alternative and the various forms of combination 
complex alternatives were overused as correct answers. In fact, “All 
of the above” was keyed as the correct answer 51% of the times 
it appeared in items, and combination alternatives were keyed as 
the correct answer 42% of the times they appeared in items. “None 
of the above” was seriously underused as a correct answer, being 
keyed as the correct alternative only 10 out of the 94 times it 
appeared in test items. This disproportionate use of complex alter- 
natives as correct answers may have seriously affected item diffi- 
culty and discrimination. It is quite likely that if the “All of the 
above” alternative and the various forms of combination alterna- 
tives had been used more often as distractors, and if the “None of 
the above” alternative had been used more often as the correct 
answer, items utilizing these three forms of complex alternatives 
would have been more difficult and better discriminating. 

The second qualification affects only the combination complex 
alternatives. This category contained several discrete kinds of alter- 
natives, Consequently it is impossible to determine from this study 
the differentional effectiveness of the various kinds of combination 
complex alternatives. 
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THE ANALYSIS OF MULTIVARIATE GROUP 
DIFFERENCES 


ALAN L. GROSS 
The City University of New York 


A computer program for evaluating the differences between 
k > 2 groups on p > 1 dependent variables is described. The 
statistical rationale for this program is based upon the Roy Union 
Intersection approach. A useful feature of the program is the com- 
putation of simultaneous confidence intervals for comparing the 
groups on each dependent variable and each discriminant function. 


A common research problem encountered in the social sciences is 
that of identifying a set of p > 1 variables that will discriminate 
among a set of k > 2 groups. For example, а school counselor might 


wish to determine whether a set of biographical measures discrimi- 


nates between student drop outs and non-drop outs. In a controlled 
k > 2 experimental groups differ 


experiment, one may ask whether 
significantly from each other on p 2 1 dependent variables. The 
purpose of this paper is to describe the MANOVA Computer Pro- 
gram for ascertaining whether k groups differ significantly from one 


another on p dependent variables. 


Rationale for the MANOVA Program 


‘or studying these multivariate discrimina- 


A powerful technique f 1 
tion problems is the Roy Union Intersection approach (Morrison, 
p by 1 mean 


1967). The null hypothesis to be tested is that, the 
vectors of k multivariate normal distributions are all equal. 


Нели pace 2 mui = Dory йин tona (1) 


Ап equivalent statement of H, is that for any linear combination 
of the p dependent variables, every comparison among the groups 
Copyright © 1975 by Frederic Kuder 
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has a value of zero. More specifically, letting D = X ar, deno 
some arbitrary linear composite of the original dependent variabl 
(a1, 2, +++, ту), and сі, C2, °°°, Сұ а set of comparison weights 
(X с, = 0), the null hypothesis can be stated as 


H: Ус Уаш: = 0 forall a; 
for all cj; Jc; = 0 ( 


The hypothesis of equal mean vectors (1) is true, if and only i ^j 
statement (2) is true. $ j 

For a particular choice of the “a” weights and the “c” weights, | 
a null hypothesis can be tested through a univariate F test. Every | Т 
possible null hypothesis of the form 3 с; X ашу = 0 сап be tested | 
by choosing the values of the c; and а; that maximize the value of. 
the F ratio and by then testing this maximized F. If the largest _ 
possible F value is not significant, then no other choice of the c; and _ 
а; can produce a significant F. In this case of a nonsignificant E 
ratio, there would be no significant differences between the groups. _ 
If the largest F is significant, one infers that there is at least one 3 
dimension (linear composite) that discriminates among the groups _ 
and that consequently the mean vectors are significantly different. | 
This significant dimension is the first discriminant function. М 

When the overall null hypothesis is rejected, the Roy approach 
provides a Scheffé type post hoc analysis for examining the basis 
for rejecting Ho. This analysis furnishes tests of significance between 
every pair of groups on every individual dependent variable and 
also on every discriminant function. The experimentwise level of 


Significance is controlled in this analysis, regardless of the numb: 
of tests performed. 


The Program 


The computer program MANOVA is based upon the Roy ap- 
proach. Program input consists of control cards specifying the 
number of variables, number of groups, group sizes, critical value, 
and variable format cards. Program output first consists of a test 
of the overall null hypothesis based upon the largest eigen 100 
criterion. (The maximum F statistic is proportional to this root. 
If the overall hypothesis is rejected, the program then provides 4 
complete post hoc analysis of mean differences in terms of each - 
individual dependent variable, and each of the discriminant fune-- 
tions. The standardized discriminant function weights and the group - 
centroids in discriminant space are also computed. 
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The program was written in FORTRAN for an IBM 370 com- 
puter. A source copy of the MANOVA program as well as a program 
manual can be obtained by writing the author. 
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COMPUTER PROGRAMS FOR ROBUST ANALYSES IN 
MULTIFACTOR ANALYSIS OF VARIANCE DESIGNS 


PAUL A. GAMES 
Тһе Pennsylvania State University 


Robust analyses of multifactor analysis of variance data may be 
secured even when assumptions are violated, by use of a set of 
five programs. ANOVR produces means, the summary table, group 
variances, the variance-covariance matrix on repeated measures, 
and tests most assumptions. FOLUP tests any set of pairwise 
contrasts of means, while COMCON tests any set of complex 
contrasts. Both latter programs permit solutions using hetero- 
geneous group variances, ог using a heterogeneous variance- 
covariance matrix from repeated measures, as well as conventional 
solutions using mean squares from the summary table. 

ANOVR produces the Box-Geisser-Greenhouse index when there 
is just one repeated measure factor, while PVCVRL will produce 
corresponding indices when there are two or more such factors. 
BARTKI produces output that generates the robust and flexible 
Bartlett and Kendall test for homogeneity of variances. Intelligent 
use of the set of programs frees the user from most assumptions 


of AOV. 


Tus paper describes а set of five programs providing robust 
techniques that work well when the assumptions of conventional 
analysis of variance (AOV) have been violated. The main program, 
ANOVR, includes relatively complete tests of assumptions and 
provides punched output that constitutes input data for three other 
programs. It can handle up to four between-subject factors and up 
to four within-subject factors and thus will permit а total of eight 
factors in a mixed design. References are made to sections in Winer 
(1971) that illustrate the designs and features described. 
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Designs with Subjects Nested under Factors; Independent Group or 
Between-Subject Factorial Designs (Winer, 1971, Chapter 3, 5, 6) 

Proportional n’s are required. Homogeneity of cell variances is 
tested by Bartlett’s test. Main effect means, simple-main means, or 
simple effect cell means may be tested using all +1, —1 contrasts 
(all pairs) by means of program FOLUP. If the homogeneous vari- 
ance hypothesis is retained in the equal n condition, Familywise 
Type I error rate may be controlled by employing the Newman- 
Keuls test, or the Tukey Wholly Significant Difference (WSD) 
(Games, 1971). If unequal n’s are present, a modified WSD may 
be used to control familywise error rate, or multiple Ёз may be 
found to control Type I error per contrast. 

Тһе modified WSD consists of comparing a studentized range 

value, g(a, К, @]»)/ v2, to the conventional t statistic, (Y, — Y,)/ 
(М8,/т, + MSy/n,-)'?. 1 heterogeneous variances are indicated, 
individual cell variances are read, the / statistic is replaced by the 
Behrens-Fisher statistic, and dfw is replaced by the Welch df solution 
(Winer, 1971, p. 42). Howell and Games (1974) showed that these 
modifications of the WSD provide for a high degree of control of the 
familywise Type I error rate despite unequal variances, and a later 
study has extended this finding to the unequal n case. When the 
homogeneous variance assumption is true, the use of the Behrens- 
Fisher solution rather than the uniformly most powerful £ test causes 
only a small loss of power for »’s greater than 10. 
à Program COMCON effects tests of complex contrasts by employ- 
ing previously punched means and variances, The use of MSw for 
the homogeneous variance case, or the s? values of individual cells 
for the heterogeneous case is specified by the consumer. The hetero- 
geneous variance solution is by the robust Welch generalized ¢ 
(Welch, 1947). The Welch Р” statistic, a robust alternative to the 
conventional F, is available as an option (Brown and Forsythe, 
1974; Kohr and Games, 1974). 

If the conventional equal п, homogeneous variance situation 
holds, the +1, —1 or complex contrasts are so easily done on hand 
calculators that use of FOLUP or COMCON is unnecessary. How- 
ever, these programs save a great deal of effort and time when 
applied to data containing unequal n’s and/or heterogeneous vari- 
ances. 


Designs with Subjects Crossed with Factors; Repeated Measure 
Designs (Winer, 1971 » Chaper 4) 
The ANOVR program differs from other general AOV programs 
in that it automatically computes the variance-covariance (ИСТ) 
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matrix on all repeated measures. Behavioral scientists should pay 
more attention to these matrices and to the correlation matrices 
that may be derived from them. Wiley, Sehmidt, and Bramble (1973) 
have illustrated struetural analyses testing models of such VCV 
matrices. The VCV are often a source of psychological interpreta- 
tions as well as of information indicating that the conventional 
assumptions of F = MSz/MSs; are grossly violated. 

Huynh and Feldt (1970) demonstrated that the mean square 
ratios are distributed as F with conventional df only if the popu- 
lation ҮСУ matrix has properties that produce а Box-Geisser- 
Greenhouse index, A, of 1.0. Box (1954) and Geisser and Greenhouse 
(1959) showed that the mean square ratio is approximately dis- 
tributed as F (Adf,, Adfa). The conservative solution is to employ a 
minimum value of A so that the F(1, n — 1) is used. Collier, Baker, 
Mandeville, and Hayes (1967) demonstrated that control of Type I 
errors is maintained by estimating А from the sample VCV matrix 
and multiplying the usual df by the estimated value. This solution 
has far greater power than does use of the conservative (1, n — 1) 
distribution. With a single repeated measure, ANOVR computes А 
and reports the probability of a mean square ratio under all three 
solutions. 

With two repeated measure factors at levels J and K respectively, 
ANOVR computes the JK by JK УСУ matrix, and determines the 
probability of a mean square ratio using the conventional df and 
conservative df solutions. To obtain the marginal VCV matrices 
corresponding to the tests of main effects (e.g. see Winer, 1971, 
р. 552) this УСУ matrix is submitted to program PVOVRL. PVCVRL 
also computes Ау and Ақ as needed to adjust the df of the tests of 
main effects for the Collier, et al. (1967) solution. This process may 
be extended for additional repeated measure factors; & three factor 
example is given in the PVCVRL write-up. The correlation matrix 
from any УСУ matrix is computed, if requested. | р 

The test for compound symmetry of a VCV matrix (Winer, 1971, 
р. 596-599) is carried out on the original VCV matrix by ANOVR, 
while PVCVRL can conduct this test on any matrix read in or 
generated. Similarly the Machley test that Huynh and Feldt (1970) 
employ to test whether the population А = 1.0 is conducted by both 
ANOVR and PVCVRL. Since the compound symmetry case 18 8 
special case of à = 1.0, the latter test is more pertinent. 

Programs FOLUP or COMCON may be used with Mz from the 
summary table for doing contrasts when А ~ 1.0, or with the punched 
ҮСУ matrix for doing contrasts when À is substantially less than 1.0. 
Again solutions using MSz are easily done by hand calculators, but 
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solutions employing the VCV matrix are more conveniently done on 
the computer. The latter solutions are completely general, but 
slightly conservatively biased if the conventional simple additive 
model assumptions are met. 

For multifactor designs, the mean squares that may be needed 
for tests of means of one factor at a given level of another factor 
(Winer, 1971, p. 545) are automatically computed by ANOVR. 
These mean squares provide appropriate MS, estimates when 
assumptions are met. Appropriate VCV matrices generated by 
ANOVR or PVCVRL may be employed in FOLUP and COMCON 
when the assumptions are not met. 


Mixed Designs; Subjects Nested under Some Factors but Crossed 
with Other Factors (Winer, 1971, Chapter 7) 


All of the features of the above two cases are included in mixed 
designs. In the simplest such design (Winer, 1971, р. 518), with а 
levels of the between factor, and b levels of the repeated measures 
factor, there would be a different b by b VCV matrices, one for each 
independent group. These are computed by ANOVR and may be 
printed or punched as requested. From each such matrix, а Мв 
term is computed that would be the appropriate MS, for testing 
М8, computed on this group alone. Bartlett’s test is used to test 
the homogeneity of these values (Winer, 1971, p. 522). 

The a different VCV matrices are pooled and are tested for homo- 
geneity by the Box technique (Winer, 1971, p. 595). The test for 
compound symmetry and the previously mentioned Machley test 
are conducted on the pooled VCV matrix by ANOVR. If an experi- 
menter wishes to test the homogeneity of any subset of VCV 
matrices, this step may be undertaken by using PVCVRL. Similarly 
PVCVRL сап provide a pooled matrix for any set of VCV or corre- 
lation matrices, and will test for homogeneity, compound symmetry, 
огл = 1.0. 

The value of MS; is also computed separately for each of the 4 
independent groups, and Bartlett's test is used to test for homo- 
geneity (Winer, 1971, p. 521). If a significant AB interaction is 
observed, and if simple effect tests are desired, FOLUP or COMCON 
can be employed with the individual group variances or with the 
individual group VCV’s as appropriate. Thus it is possible to obtain 
robust tests despite violations of the conventional assumptions. Simi- 
larly, if the А computed on the pooled VCV matrix is substantially 
less than опе, and if MS;/MSgp,4) is significant, use of this ҮСҮ 
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matrix in FOLUP or COMCON will provide a robust test on B 
main effect contrasts. With three and four factor mixed designs, the 
amount of output produced rises rapidly, particularly if the indi- 
vidual group VCV matrices are requested. However, intelligent use 
of these various outputs can furnish relatively powerful tests with 
substantial control of Type I errors, even when the assumptions are 
violated. 

The fifth program is designed to provide robust tests of homo- 
geneity of variance. Bartlett's test (Winer, 1971, p. 208) and other 
classical tests of variance are decidedly permissive when the popu- 
lations sampled. are leptokurtic rather than normal or platykurtic. 
The BARTKI program is designed to take data prepared for an 
analysis by ANOVR or BMDOSV and to compute log 5? values on 
subsamples of that data. These values may then be submitted to 
ANOVR to complete the Bartlett and Kendall test of homogeneity 
of variance (Winer, 1971, p. 219-220). In addition to being robust 
to non-normality, this test can be used to evaluate hypotheses about 
variances that correspond to main and interaction effect tests in 
multifactor designs. (See Games, Winkler and Probert [1972] and 
Gartside [1972] for further exposition of the Bartlett and Kendall 
test.) In general the Bartlett and Kendall test has lower power than 
the Bartlett test, but it protects against excessive risk of Type I 
errors. Thus the Bartlett test of ANOVR may be employed ав а 
safe negative indieator. If it is not significant, the hypothesis of 
homogeneity of variance may be retained. However, if the hypoth- 
esis of homogeneity is rejected by Bartlett’s test, it is desirable to 
confirm this conclusion by use of the Bartlett and Kendall test. The 
BARTKI program, which is written using the ANOVR conventions, 
produces punched output that may be directly “input” to ANOVR 


for final processing. 


Write-ups and a tape copy of these programs may be secured by 


sending a small tape to the author. All but the rather small FOLUP 
program use dynamic storage allocation to minimize storage de- 
mands, ANOVR and PVOVRL are available in different sizes to 
fit different computer capacities. In each case, the smaller size 
versions delete various options and features to save space. On a 
small computer, the use of several small programs can accomplish 
what is possible in one run on à large version of the ANOVR 
program. The tape includes copies of the several programs, the data 
examples used in the write-ups, and additional data examples for 
testing. 
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A FORTRAN PROGRAM FOR SIMULATING 
EDUCATIONAL GROWTH WITH VARYING 
SCHOOL IMPACT! 


JAMES M. RICHARDS, JR., NANCY KARWEIT, AND 
TRUMAN W. PREVATT 


The Johns Hopkins University 


То facilitate empirical investigations of longitudinal method- 
ology, a computer procedure was developed to generate artificial 
data in which true growth scores are known. This procedure simu- 
lates the results of the Educational Testing Service Growth Study 
that pertain to the relationships among scores on a test of academic 
potential and scores on a test of educational attainment adminis- 


letermine school impact. The 


program user is allowed to specify the correlation between resources 
and impaet and the extent to which schools vary in impact. Stu- 
dent scores and school means generated by this program are entered 


on separate output tapes. 


A variety of statistical procedures has been proposed for over- 
coming the difficulties in assessing educational growth or psycho- 
logical change (Cronbach and Furby, 1970). Because true growth 
scores are unknown in most longitudinal research, however, it has 
been difficult to compare the relative accuracy of these procedures. 
Accordingly, a computer procedure was developed to generate arti- 
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ficial data in which these true growth scores are known. Suh | 
artificial data should facilitate empirical investigations of longi- 
tudinal methodology. 


Underlying Model 


It is important that artificial data resemble real data as closely | 
as possible to insure that the conclusions of methodological investi- | 
gations will apply to the analysis of real data. Therefore, this 
program simulates selected aspects of the Project TALENT study 
of the American high school (Flanagan, Dailey, Shaycoft, Orr, and 1 
Goldberg, 1962) and of the Educational Testing Service (ETS) | 
Growth Study (Hilton, Beaton, and Bower, 1971). Project TALENT 
provided intercorrelations among a variety of community, school, | 
and student characteristies for a representative sample of high 3 
schools in the United States. In the ETS Growth Study, students | 
were assessed initially with a measure of academic potential, the 4 
School and College Ability Tests (SCAT), and а measure of educa- | 
tional attainment, the Sequential Tests of Educational Progress 
(STEP). Subject to the usual attrition in longitudinal research, the 
educational attainment of these students was reassessed with STEP 
on three subsequent occasions. 

Although in the typical computer procedure for producing corre- 
lated scores on two variables, A and B, both scores are generated at 
the same time, in this program the A scores usually are given and 
а corresponding set of В scores with the specified correlation pan 
between A and B is generated by the computer. Accordingly, B 
Scores are created by the following equation: 


В-р.А--2 V1 — рав? 


where both A and B have a mean of 0 and standard deviation of l, | 
and where Z is a random normal variate. E 

An important characteristic of many real data is that students 1 
are assigned to schools nonrandomly whereas in most statistical | 
tests it is assumed that subjects are assigned to treatments randomly. _ 
This program permits the user to choose between random and non- - 
random assignment. When students are assigned nonrandomly the. 
program strives to reproduce the average correlation (р = 54) _ 
between community per capita income and average academic poten | 
tial of students estimated from the Project TALENT study. Specifi- | 
cally, a (pseudo) random normal variate is generated for each schoo 
and is treated as the per capita income of that school's home com: } 


Ж 
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munity. Then academie potential scores for the students at that 
school are created so that across schools the underlying correlation 
between income and average potential is .54 and so that the ratio 
of between schools variance to total variance on potential simulates 
the Project TALENT ratio. 

Тһе program also strives to reproduce the interrelations among 
academic potential and educational attainment on four occasions 
obtained in the ETS Growth Study. Several true score parameters 
were estimated in an earlier investigation (Richards, 1974) and used 
in this simulation. Specifically, each student/s true seore on initial 
educational attainment is generated through using the estimated 
true score correlation (p — .88) between academic potential and 
initial attainment. Then a true gain score is produced for that 
student on the basis of a combination of several estimated param- 
eters and is added to yield the true attainment score for that stu- 
dent on occasion 2. Similarly, gain scores are generated and added 
sequentially to yield true attainment scores on occasions 8 and 4. 
After the appropriate amount of random error is added to each 
score, the scores are transformed to the metrie of the ETS Growth 
Study observed scores. This simulation procedure closely reproduces 
the ETS Growth Study results (Richards, 1974) . 

Finally, it is assumed in the program that community income 
determines school resources and that school resources in turn deter- 
mine school impact. Specifically, a measure of resources is created 
for each school on the basis of the correlation (p = 25) between 
community income and resources estimated from the Project 
TALENT results. There is little empirical basis, however, for esti- 
mating either the correlation between resources and impact or the 
extent to which schools vary in impact. Therefore, the program 
allows the user to specify both the correlation between resources and 
impact and the standard deviation of the impact variable. This 
standard deviation is specified in a form of a number between 0 


and 1. When the standard deviation is set at 110, the average true 


growth scores are the same as those obtained in the ETS Growth 


Study for a simulated school with average impact, and are 10% 
higher than the ETS averages for a simulated school one standard 
deviation above the mean on impact. (The simulated data appear 
to meet the necessary assumptions for this manipulation even if the 
ETS data do not.) When the various scores for students at a given 
school are computed, the average growth scores are adjusted in 
accordance with school impact, and no other changes are made. 
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The input to this program consists of two control cards. The fi 
control eard provides the following program controls: 
1. Columns 1-2. Number of tape drive for school tape. 
(See Output section of this paper.) 
2. Columns 4-5. Number of tape drive for student tape. 
3. Column 7. A 7 in this column suppresses writing of the sch 
tape, 
4. Column 9. А 1 in this column suppresses writing of the stu 


The second control card includes the following parameters spe 
fied by the user: ) 


1. Columns 1-3. Number of schools. The permitted range is 001— 
100. 1 
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randomly (1.е., іп accordance with Project TALENT results 


w 


Columns 5-8. Average number of students per school. 
permitted range, which is 0001 to 9999, should be set so thi 
the produet of the number of schools and the average number: 
of students does not exceed 20,000. 

Columns 9-10, Standard deviation for number of students 
school. The permitted tange, which is 00 to 99, should be 
small enough relative to the average number of students to 
eliminate much chance of a negative number of students fo 
any school (if such a negative number occurs, the program sets 
the number of students for that school at 1). When this param- 
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columns, 


. Columns 17-18, Standard deviation for school impact. 
parameter takes the form of a number between 00 and .99 
(again the decimal is not punched), When this parameter 
00, schools do not differ with respect to impact. | 

7. Columns 19-25. Random normal variate initialization. Т 

parameter must be a seven digit odd random number Бебе 

0000001 and 8388607 (in accordance with FORTRAN limita- 


о 
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tions, this maximum is 228 — 1). This number provides the 
starting point for the generation of random normal variates. № 
should be different for each independent set of simulated data 
(the identical sequence of normal variates will be generated 
each time the same seven digit number is used as the starting 


point). 


Output 

This program provides the following three outputs: 

1, Student tape. For each student, this tape includes a school 
ID number, a student ID number within that school, the three 
true gain scores, and both true and observed scores for academic 
potential and for educational attainment on four occasions. 

2. School tape. For each school, this tape includes a school ID 
number, community per capita income, school resources, school 
impact, average academic potential as computed from per 
capita income, and the average true and observed scores for 
students at that school on academic potential and on educa- 
tional attainment on four occasions. 

3. Correlation matrices. In addition, а printed output provides 
observed score means, standard deviations, and intercorrela- 
tions for the academic potential variable and the various mea- 
sures of educational attainment. The first of two matrices 
summarizes the relationships among scores for students at all 
schools combined and the second summarizes the relationships 
among school means. The second matrix is not computed when 
the number of schools is less than 25. 

Limitations 

written in FORTRAN IV, requires the 


This program, which is 
IBM 7094 computer. Two 


equivalent of 6500 core locations on an 
tape drives for output are also required. 


Availability 
For a copy of the source deck, write to Center for Social Organiza- 
tion of Schools, The Johns Hopkins University, Baltimore, Mary- 
land 21218. Please enclose five dollars ($5.00) to cover costs of 
reproduction, handling, and mailing. Purchase orders should be 
payable to The Johns Hopkins University. 
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A COMPUTER PROGRAM TO TEST A REPEATED 
MEASURES HYPOTHESIS USING HOTELLING'S 
ONE-SAMPLE T? STATISTIC 


PETER P. VITALIANO лмо SILAS HALPERIN 
Syracuse University 


The extent and direction of bias in an approximate test (7,2) 
of а one-way repeated measures hypothesis was studied. Monte 
Carlo methods were used to simulate nine multinormal parent 
populations, Five thousand samples were drawn from each popu- 
lation and an Ё statistic (transformed from T,2) was calculat 
{ог each sample. Nine sampling distributions, each contaiming 
5,000 Ев, were then observed. Л 

T,? was shown to be very conservative—the proportions ob- 
served in the upper tails of the nine distributions were much 


smaller (in most cases one-half the size) than the proportions 


of a program which computes 
ommended, over available package programs, 
minimum input. 


THE usual way to test a repeated measures hypothesis: 
Нш = ha 7 (1) 


is to use a univariate F test. A sufficient assumption for this test to 
be exact is that the population covariance matrix 2 possess com- 
pound-symmetry (equal covariances and equal variances). Ап al- 
ternative way to test (1) is to use Hotelling’s one-sample T" test 
which does not require the compound-symmetry assumption. 
The T" statistic has the general form: 

T2 = nt — w)S (t e) Q) 
where n is the sample size, X and ws are respective p-dimensional 
Copyright © 1975 by Frederic Kuder 
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of u, are equal, and not necessarily equal to а set of specified 
(i.e., (1) ), the Т? in (2) is inappropriate. However, the vector 
able in (2), namely £, may be transformed to а new vector va 
У = Cx, where C is a full rank (p — 1 X p) matrix that is chosen sí 
that the hypothesis Не: и, = О implies the hypothesis іп (1). 
. Anderson (1958) and Morrison (1967) have shown that if 
rows of C are linearly independent and if they are also the 


cients of contrasts, then 7? based on У, will test the hypothesis | 
(1). This statistic is defined as: 


T£ -($— w) S, (y y 
where У and y, are respective P — 1 -dimensional vectors of tra 
formed sample treatment and hypothesized population means 
S, = CS,C' is the sample transformed (p — 1 X p — 1) covaria 
matrix. Given (3), it can be shown that 

n — p + 1) ridi 
@ Dp DT = РФ ln -р+ 1). 
. Two standard textbooks (Winer, 1962; Kirk, 1968) present a 
inexact alternative to the univariate Е, Although Winer states (i 
footnote) that this statistic is only approximate, Kirk presents it 
аз an exact statistic. This approach tests (1) by the approxim 
statistic: 
ыы: 

(— 574 = Fo,n-p), 

where 


T! —nB S, В, 


One should observe two differences in (3) and (5). First, У’ iS 
p — 1 -dimensional row vector: 


ў = (& 742% 2, :-.,41 - ©), 
whereas В” is а р -dimensional row vector: 
B= @ —0 2 G, ...,2, — б). 


Second, because the grand mean, G, is a linear combination of 8 i 
means, 2,75, the C matrix which produces the B/ in (5) is not 
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rank. Thus, if this C were used to transform S., the transformed 
covariance matrix would be singular. It is for this reason that (5) 
contains the untransformed covariance matrix, S.. 


In order to study the distributional properties of T4?, the authors 
empirically generated distributions of (4). Monte Carlo methods 
were used to create three multinormal populations for different 
numbers of treatments, namely p — 2, 3, and 5. From these popula- 
tions, samples of different sizes (n) were chosen so that the denomi- 
nator degrees of freedom, n — p, would be equal to 10, 30, and 120 
regardless of the size of p. Thus nine sampling schemes were formu- 
lated for each (p, n — p) seheme: 5,000 samples were drawn, 7, 
was calculated for each sample, and a sampling distribution of 
5,000 observed Fs was formed. 

Table 1 presents the discrepancies between observed and nominal 
a levels for the nine generated distributions. Because T4? is shown 
to be so conservative its use, by an unaware researcher, would pro- 
vide a test with less power than might be anticipated. 

Given these results, the following recommendations are offered 
for researchers who wish to use T? to test (1). First, general pro- 
grams available should be used with care: such programs as the 
Multivariate General Linear Hypothesis BMDX63 and Multi- 
variate Analysis of Variance and Covariance BMDX69 (Dixon, 
1968) require a reparameterization of design variables; the tabular 
results do not support the reparameterization scheme (in (5) 
implied by Kirk (1968) and Winer (1962). It should be mentioned 
that Winer (1971), in his second edition, has provided a correct 


iscrepaneies between Observed. Proportions and Expected Proportions (Upper Critical Regions 
of .05 and .01) for the Nine Sampling Distributions 


« = nominal expected critical region = .05 
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transformation. Second, for less sophisticated users, such as those 
not familiar with programs that require transformations, a 7? 
program is offered which requires minimum input for use. 


Input 


The data deck for each experiment consists of 2 + n cards, where 
the first card contains the number of treatments and the number of і 
subjects and the second card contains the variable format of the - 
scores to be read in for each subject. Each of the next n cards (corre- | 
sponding to n subjects) contains a score vector for each subject, | 
where the format of each card's scores conforms to that specified | 
on the second card. The program handles a maximum of 20 treat- | 
ments. 


Output 

The computer output includes: the number of treatments and 1 
subjects, a table of treatment means, standard deviations, and the 

corresponding covariance and correlation matrices, a vector of trans- | 

formed means, the Hotelling T? statistic, and the F ratio with its | 

appropriate degrees of freedom. 


Availability 


A listing of the program may be obtained from Peter P. Vitaliano, | 


Department of Psychology, Syracuse University, Syracuse, NeW 
York 13210. 
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LPA2: A FORTRAN V COMPUTER PROGRAM FOR 
GREEN'S SOLUTION OF LATENT CLASS ANALYSIS 
APPLIED TO LATENT PROFILE ANALYSIS 


BERTIL MÁRDBERG 
University of Bergen, Norway 


А computer program for solving latent profile analysis (Gibson, 
1959) is presented. Green's (1951) solution for latent class analysis 
was used. N observations on p variables constitute the data 
matrix. The program estimates latent mean vectors and classifies 
the observation vectors to them. The result is a number of clusters 
of the observation vectors. А x?-test of local independence is 
defined. Latent and observed discriminabilities are defined as the 
proportion of a variable's variance which is explained by between- 
variance. Discriminability constitutes the lower bound for esti- 
mated variable reliability. The program performs the analysis for 
а specified series of numbers of latent profiles and specified values 
on the diagonal of the product moment correlation matrix. 


Latent profile analysis, ПРА, for continuous variables was 
developed by Gibson (1959) as a generalization of latent class 
analysis, LCA (see Lazarsfeld, 1959). He suggested Green’s (1951) 
solution of latent class analysis as an estimation method for latent 
profile analysis. LPA2 is a program for the solution of Green’s 
estimation method applied to latent profile analysis. The program 
constitutes a system for the arrangement of observation vectors into 
homogeneous clusters from continuous variables. 

Latent profile analysis is a submodel within the general system 
latent structure analysis, LSA (Lazarsfeld and Henry, 1968), which 
serves as a model for relations between manifest (observable) 
Variables and latent (nonobservable) variables. Latent profile anal- 
ysis can be derived from two assumptions: the first being that of 
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local independence and the second that of a discretely distributed 
latent variable. 
The general model for ІРА is: 


Tii a > ФЗа2а2аба... (1) 
where т; . . . are the product moments, 1,j,k,l, . . . being indices for 
p variables; q, is the proportion of subjects in latent profile s. There 
are т profiles. The 2, entries are the latent profile elements for 
variable $ and profile s in standard scores. The right-hand side repre- 
sents parameters in the model. 

Green’s solution uses information up to and including third order 
product moments, i.e., equations of the type: 


Т = Ж Qs isis ks (2) 

By using manifest information (the left-hand side) in (2), one 
can transform factors of produet moments up to and including the 
second order into estimates of latent profiles (2,) and proportions 
(9). The number of latent profiles сап be less than or equal to 


p + 1. A solution of the model presupposes a given values for m, the 
number of latent profiles. 


Description 


The program reads in an N*p (N observations on p variables) 
data matrix. The variables are standardized to have means of zero 
and standard deviations of unity. Product moments up to and 
including the third order are computed. Through use of a given 
value for m, the matrix equations are solved—an outcome which 
give estimates of z, and q,. This step is done for different values 
on the diagonal to the linear product moment matrix (ru). The 
program also permits an iterative process for stabilizing the esti- 
mates. Each observation vector is classified to the nearest latent 
profile, and the set of №, observations which are classified to latent 
profile s constitutes a cluster. The mean vector for this cluster 18 
an observed profile. The latent profiles which are not allocated any 
observation vectors are known as nonsense profiles. The observe 
profiles constitute the results of the clustering. The estimating and 
classifying procedures are undertaken for a series of values of 77. 

Each analysis of a series of the number of latent profiles, т, 20 
of the values on the diagonal, т, is evaluated for its fit to the mode 
and for the predicted number of wrongly classified subjects. 
x’-test of local independence is defined after Anderson (1958, PP: 
264-267). Discriminability is defined as 
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d'(i = У) q,2:. (вее Meredith, 1965), (3) 

where d?(i), the proportion of a variable's variance which is ex- 

plained by between-variance, constitutes the lower bound for esti- 
mated variable reliability. 

Preliminary simulation studies indicate that in order to keep the 

pereentage of wrongly classified observation vectors below about 

15, the average observed variable discriminability should exceed .70. 


Input 
Parameter card, format card, and observation vectors are needed. 
The number of variables, the range of expected number of latent 
profiles, and the range of diagonal values, ти, are specified for the 
standard case. 


Output 
The output consists of (1) overall means and standard devia- 
tions, as optional linear and triple correlation matrices and eigen- 
vectors and eigenvalues of the linear correlation matrix; (2) for 
every given value of m and ти: 
Check of the numerical solution. 
Test of local independence. 
Distances between latent and observed profiles. 
Estimated and observed profiles. 
Observed profiles and within standard deviations in raw scores. 


Latent and observed diseriminabilities. 


Mean diseriminabilities. 
Lists of the assigned subjects to the latent profiles (the 


clusters). j } 
Optional: Within linear correlation matrices and eigenvectors 


and eigenvalues of these matrices; 
(3) summary of the solution process. For every value of т and fa 
the number of observed profiles and the mean discriminabilities 


are given. 
Limitations 

The numbers of variables p and of profiles m are restricted as 
follows: p < 30; m < 25. The program takes approximately 40K 
words of core. 

Computer and Program Language 

The program is written in FORTRAN V for a UNIVAC 1110 in 

double precision. 
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Availability 

A copy of the UNIVAC-version (for which the reader is 

to provide a tape), a copy of the source listing, and a write 

with test data may be obtained from Bertil Mardberg, Depa 


of Psychometrics, Institute of Psychology, University of Berg 
Р.О. Box 25, N-5014 Bergen U. Norway. 
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A POPULATION SUBGROUP MULTIPLE COMPARISON 
COMPUTER PROGRAM BASED UPON 
CATEGORICAL DATA 


BERNARD A. RAFACZ 
Navy Personnel Research and Development Center! 
San Diego, California 


Much of the information obtained on à particular population 
within psychological and educational research involves categorical 
data, usually in the form of nominal responses to items on a 
questionnaire. It is often of interest to compare various sets of 
subgroups of the population with respect to their response dis- 
tributions to the items on the questionnaire. Procedures for per- 
forming multiple subgroup comparisons have been presented by 
Snee (1974) and Gabriel (1966). The question considered is 
whether or not a set of subgroups is homogeneous over the item 
response categories. Furthermore, if a set of subgroups is declared 
heterogeneous for an item, which subsets of that set may 
declared heterogeneous? It could be necessary to perform subgroup 


naire item. 


Тнв purpose of the FORTRAN IV computer program discussed 
herein is to perform multiple comparisons of the subgroups of a 
population with respect to their response distributions on question- 
naire items. The program was designed to be flexible enough to be 
able to perform such comparisons for any number, sequence, and 
combination of subgroups of & population over any reasonable num- 
ber, sequence, and combination of categories for an item. Finally, 

1The opinions expressed are those of the author and do not necessarily re- 
flect those of the Navy Department. 
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the output is designed to be readily understood by persons with 
little or no background in statistics. 


ж 


Арртоасһ 


Let S be a set of s subgroups of а population and C а set of с 
categories of а questionnaire item. Further, let 74; be the observed 
sample of individuals from the ith population (i — 1, +++ , 8) те 
sponding to the jth category (j = 1,::: , c). The hypothesis con- 
sidered is whether or not the subgroups S are homogeneous over 
the categories С. The subgroups S are asserted to be heterogeneous 
over the categories C if the following condition is met: 


21(5, С) > ô, 
where 8 is the upper @ percentage point of the chi-square distribu- 
tion with (s — 1) (c — 1) degrees of freedom. I(S,C) is the likeli- 
hood ratio statistic [see Mood (1959) | given by: 


KOSE > 0; lun — ni Inn, 
И 4 


- Уп, Inn, + nnn, 
where 


М à SUM 
Mie = Уп, ти = P» andn = У) jn. 
i i 


The program user selects a collection of sets of subgroups of the 
population and the number of categories for each item to be an- | 
alyzed. Utilizing the aforementioned test statistic, each set of sub- 
groups is, or is not, asserted to be heterogeneous over the categories | 
under consideration. If the subgroups in a comparison set cannot 
be asserted to be heterogeneous, the analysis is complete for this 
set for those categories, That is, as Gabriel (1966) pointed out, 
if a combination of Subgroups cannot be asserted to be hetero- 
geneous, then no combination of subgroups formed from that веб 
could be asserted to be heterogeneous. However, if a set of sub- | 
groups 18 asserted to be heterogeneous, the program searches among 
all possible pairwise combinations of the set of subgroups for | 
additional heterogeneous subgroups over the same categories. This 3 
analysis is performed upon each веб of subgroups the user provides 
to the program. г 

Subgroup comparisons based upon the combining or collapsing 
of subgroups and/or categories can be made within the same run, 
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or subsequent runs. Comparisons on any combination of subgroups 
or categories is possible, limited only by the program array sizes. 
Each item may be re-evaluated any number of times within the 
same run for any desired combination of categories. 


Input 


Input to the program consists of a sequence of information which 
describes in some way the groups that are to be compared and the 
items to be analyzed. The information that the program requires 
includes: (1) a general description of the population being con- 
sidered; (2) the alpha level (Type 1 error) at which the groups 
are to be compared; (3) the number of characters (ICONT) to be 
read from a data record; (4) the total number of subgroups (NG) 
for which questionnaire item tabulations will be-found; (5) the 
number of questionnaire items (№); (6) the unit (NU) on which 
the data for the subgroups is to be found; (7) a variable format 
statement for the data; (8) a set of NQ cards (each card in the set 
describes for a questionnaire item the location of the item in the 
data record, the characters on which tabulations are to be made 
for the NG subgroups, the number of categories of the item over 
which comparisons are to be performed, and any re-editing of the 
original data characters for this item analysis); (9) а set of cards 
for each of the NQ questionnaire items (each set describes the 
questionnaire item and its response alternatives); (10) a set of 
NG cards that describe the groups being compared; and (11) the 
number of sets of subgroups (NCOMP) being compared and the 
subgroups that are to appear in each comparison set. 


Output 

The output for each questionnaire item analysis includes: (1) а 
description of the item and its response alternatives; (2) the names 
of the subgroups being analyzed; (3) the frequency of response of 
the subgroups over the item categories to include itemization of 
illegal responses; (4) a unique symbol associated with each set 
of subgroups being compared (if heterogeneity is not asserted 
among the subgroups, only that symbol occurs; otherwise the 
symbol and plus “+” sign oceurs) ; and (5) a summary of all pair- 
Wise comparisons among subgroups for which heterogeneity was 
asserted. 

For the case of zero frequency response of 8 subgroup to some 
questionnaire item category, the test of homogeneity is not per- 
formed for those comparison sets in which the subgroup appears. 
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However, all pairwise combinations of subgroups in а comparison 
set for which that subgroup does not appear are tested for possible 
heterogeneity. i | 
Limitations 

Тһе program is presently limited to include: (1) alpha levels 
101, .05, or .10; (2) maximum of 30 subgroups; (3) maximum of | 
50 questionnaire items with a maximum of 12 response alternatives 
{о an item; and (4) a maximum of 10 sets of subgroups (each веб 
of subgroups is to be compared simultaneously over some of the 
response alternatives). All of the variable size limitations can 
be altered by increasing the array sizes within the program. 

There is no provision built into the program for a user to com- 
bine arbitrarily subgroups within one program execution. However, 
because initial runs are usually necessary to decide which subgroups 


are to be combined, the appropriate subgroup collapsing can be 
considered on subsequent runs. 


Availability 
A copy of this paper, a listing of the FORTRAN IV source pro- 
gram, a documentation package for users, and a sample problem 


may be obtained by writing to Bernard А. Rafacz, Navy Personnel 


Research and Development Center, Code 310 BR, San Diego, 
California 92152. 
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WARD AND HOOK REVISITED: A TWO-PART 
PROCEDURE FOR OVERCOMING A DEFICIENCY IN 
THE GROUPING OF PERSONS 


HUBERT 8. FEILD 
Auburn University 
LYLE F. SCHOENFELDT 
University of Georgia 


Тһе Ward and Hook (1963) hierarchical grouping program ін 
a frequently used method to cluster persons into groups. Because 
of a deficiency in the procedure, the groupings are somewhat less 
than optimal. In order to meet this deficieney, а two-part pro- 
cedure was developed to be used in conjunetion with the Ward 
and Hook program for а more optimal grouping of subjects. The 
first part of the procedure checks the assignments of the subjects 
to the groups and removes inappropriately classified subjects. The 
second part confirms the reassignments of the subjects to their 
groups. Specifics regarding the application of the two-part pro- 
cedure are discussed. 


Tun grouping of subjects into subgroups ін а common practice 
used by many investigators. Numerous techniques have been em- 
ployed to place people into subgroups which are homogeneous with 
respect to a number of grouping dimensions. One of the most pop- 
ular and frequently used is the hierarchical grouping procedure 


developed by Ward and Hook (1963). 


The Ward and Hook Procedure 

In many situations, an investigator has collected а series of 
measures (tests or other observations) on a sample of N subjects 
and desires to know which subjects have similar profiles on the 
variables. The hierarchical grouping procedure is an iterative one, 
the objective of which is to cluster systematically the № profiles on 
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the basis of their similarity. This procedure makes no assumpti 
as to the number of groups in the sample but instead begins 
considering each of the М profiles until all are in one group. At 
each stage of the grouping, all possible pairs of the groups’ profiles 

are considered and the two most similar ones combined. Through - 
each of the stages, the total number of groups is reduced by one | 
while minimizing the increment in total within-group error. An ў 
inflection in the incremental error from a given stage to the succeed- | 
ing stage indicates that the groups combined were dissimilar, and | 
the pairings at the stage preceding the inflection become the solu- | 
tion. | 


A Serious Deficiency 13 
One deficiency in the grouping procedure is that once assigned | 
to a group, the individual remains іп that group. As additional _ 
subjects are assigned to the groups, the profile of the group will | 
shift leaving the original group member(s) on the periphery of | 
group membership. Thus, the assignment of individuals to subsets | 
is usually less than optimal at the conclusion of the grouping (Ward, | 
1963). In order to correct this deficiency, a two-part procedure | 
has been developed to evaluate the fit of each subject to his assigned | 
subgroup. ) 
The first part consists of an “affirmation” program! which com- | 
pares the profile of each subject with the profile of every subgroup | 
and either affirms membership in the assigned group or removes | 
him from it. Removal could be for the reason that (a) the subject i 
was a “misfit” and should be reclassified to another group; (b) the 
subject was an “isolate” and should not be classified to any of the | 
groups; or (c) the subject was an “overlap” who fit more than one | 
group. Adjusted group means are computed following each change | 
in group membership, and the process is repeated until the number. р, 
of changes is minimal and the subjects’ group membership is ` 
affirmed. Ұй 
The second step provides confirmation by capitalizing on the | 
fact that one has "known" groups. Discriminant functions are | 
formed from the variables, and the groups are located in the dis- _ 
criminant space. The subjects are treated as “new” individuals and _ 
are classified to the groups (Cooley and Lohnes, 1971). Overlaps, 
isolates, and misfits are identified аз described previously, and 
diserepancies in results, by comparing the discriminant function 
results with those obtained previously in the variable space, are 
noted. The net result is the classification of the subjects to the 


yy 


5 с. 


1 The program was originally written by Mike Brodie. 
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groups with 100% hits in either the variable or discriminant spaces. 

Тһе application of the two-part procedure described above results 
in а "cleaner," more optimal assignment of subjects to subgroups. 
As such, the subgroups formed on the basis of these procedures 
are more homogeneous in terms of the grouping variables than are 
groups formed without using the two-part procedure. 

Output 

The output for the affirmation program consists of D? (a measure 
of profile similarity) values indicating how well individuals fit 
the groups to which they have been assigned. Through use of a 
series of decision rules, evaluations are made for each subject con- 
cerning his group membership, і.е. misfit, overlap, or isolate. The 
output for the classification program is a discriminant analysis 
which classifies the subjects to groups. Decisions regarding the 
classification are made according to the suggestions of Cooley and 
Lohnes (1971) and Rulon, Tiedeman, Tatsuoka, and Langmuir 
(1967). The objective of this step is to obtain a 100% correct 
classification of subjects to their groups. 


Program Description and Limitations 


Both programs which are written in FORTRAN IV have been 
used on IBM 360/50, 65, and 370/ 158 computers. The system can 
handle 1,000 subjects with a maximum of 30 variables on each. 
The limit on the number of groups is 25. 


Program Availability 


The affirmation and classification program listings alon 
the decision rules to be used in evaluating the program results can 
be obtained by writing Dr. Lyle F. Schoenfeldt, Measurement and 
Human Differences Program, Department of Psychology, Univer- 
sity of Georgia, Athens, Georgia, 30602. 
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ECHO: A COMPUTER BASED TEST FOR THE 
MEASUREMENT OF INDIVIDUALISTIC, COOPERATIVE, 
DEFENSIVE, AND AGGRESSIVE MODES OF BEHAVIOR 


DAVID J. KRUS 
University of Southern California 


KAREL R. BALCAR 
Charles University 
PATRICIA C. BLAND 
University of Minnesota 


four regions. This decision-making behavior was interpreted in 
relation to its underlying personality traits through using the 
imago algorithm. Standardization of test scores and their inter- 
pretations, based on variable test norms, were automated. 


Snom its beginning, the field of objective personality testing has 
been dominated by paper and pencil tests. This type of testing 
strategy measures, in the majority of cases, only reports about 
behavior, either actual or imagined. Rapid developments in the 
area of computer technology, paralleled by the increasing avail- 
ability of time-sharing computing, make feasible several new 
strategies, which attempt to measure a group of personality traits 
by analyzing the actual behavior of the participant in a computer- 
simulated game. 


The Allocation Game 


ECHO evolved from an extension and computerization of Horn- 
stein and Deutsch’s (1967) allocation game. This original three- 
parameter experimental game required subjects to channel their 
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efforts into several kinds of products; in this particular instance 
into coloring paper squares. Individualistic and cooperative modes 
of behavior were represented by black and blue squares, respec- 
tively, whereas red squares represented weapons, The computerized 
version of ECHO required subjects to make decisions about the 
allocation of monetary units into four interest areas, The instruc- 
tions, displayed on a cathode-ray tube time-sharing computer termi- 
nal, read as follows: 

This game is a model of an international situation, You can 
make decisions about the allocation of nine units (each unit repre- 
senting 10 billion dollars) into the following areas: 

Independent Enterprise: The expected payoff is double the 
original investment. 

Cooperative Enterprise: The expected payoff amounts to three 
times the joint investment. 

Defense: Zero payoff, but every unit invested by you protects 
your state against twice as many units directed against you by 
your opponent. 

Offense: The payoff which you gain from your opponent is five 
times greater than the original investment made by you. 

The political situation is uncertain and your mutual relation- 
ships with the other state (ruled by a computer named Cyber) are 
unclear. You can expect cooperation as well as aggression. Please 
make your decisions by entering nine whole units, separated by 
commas. Good Luck, 

An example of a typical game is shown as follows: 


ENTERPRISE ARMED FORCES 

INDEPENDENT COOPERATIVE DEFENSE OFFENSE 
бин 8190 

и LAYOUT. 

PATRICIA 5 3 2 0 

СУВЕВ 3 2 4 2 

теп uM Пе PAYOFF- 1 ТОТАГ. SCORES 

PATRICIA 10 6 0 0 16 78 


CYBER 6 6 0 0 12 120 


Subjects enter their choices in response to a question mark at the 
end of the third line of each game. The subject’s and computer's 
allocations are then displayed under the heading “LAYOUT.” It 
should be noted that the choices allocated to defense are doubled. 
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“PAYOFF” for both the player and the computer is displayed to- 
gether with summary scores. 


The Imago Algorithm 


The conversation of ECHO from a strategy-oriented game into 
а personality test was aecomplished by the introduction of the 
imago algorithm, a modification of the “tit-for-tat” algorithm 
(Balear, 1967; Rapoport, 1965). The rationale of the imago al- 
gorithm is similar to the reasoning behind projective tests: if there 
is nothing specific in the stimulus configuration to determine 8 
particular type of response, then the response must be solely 
determined by some aspects of the inner organization of an in- 
dividual. The imago algorithm goes one step further: if a [simu- 
lated] interpersonal exchange is determined solely by decisions of a 
single individual, then these decisions reflect some aspects of the in- 
dividual’s personality. The imago algorithm is thus designed along 
principles similar to those incorporated into the electronic feedback 
circuitry, designed to amplify signals. Amplified are those aspects 
of the inner organization of an individual which determine the 
game-related decisions. 

The full course of the imago algorith, as programmed into ECHO, 
consists of four distinct stages. During the first stage, the subject 
is presented with a neutral set of stimuli. This presentation is fol- 
lowed by the first play period, in which the computer’s strategy 
mirrors the previous responses of the subject. The first stage 18 
then repeated to enable the subject to reevaluate his strategy and 
to prevent him from discovering the principles comprising the imago 
algorithm, The fourth and last stage of the test is again char- 
acterized by the mimic feedback (ie. as in the first play period) 
inherent in the imago algorithm. 


Parameters of the Play 


There are eight parameters of the play, which can be adjusted 
by the assignment of different values to variables on lines 140 
through 210 of Version 2.0 of the ECHO program. These eight 
parameters and their corresponding descriptions are as follows: 

The number of units to be distributed is NUNITS. This pa- 
rameter is presently preset to nine. Assignment of different values 
to this parameter would probably have only a minor effect on the 
game. 

The length of the delay period (IDELAY) is preset at three. This 
parameter disguises the reciprocity of the game. Low values asso- 
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ciated with this parameter maximize the amplifying effect of 
imago algorithm, whereas higher values assure credibility of 
simulated interaction. ( 

Тһе parameter determining the length of the game (IHALF) в. 
preset to eight. The whole test in its present form therefore consists 
of 16 separate plays. er 

Тһе combination of IDELAY and IHALF parameters determines | 
the spacing and length of the tests's four component substages. Аз | 
presently programmed, ECHO consists of four play periods: first 
neutral stage (plays one through three), first play period (plays. 
four through eight), second neutral stage (plays nine through: 
eleven), and second play period (plays 12 through 16). 154 

The payoff parameters were preset on the basis of pilot experi- 
ments to the values of 2-3-[0,2]-5 for independent enterprise, co- 
operation, defense, and offense (ТРЕЕ — ІСКЕ — [IDFF, IDBON] | 
— IOFF), respectively. These values were carefully balanced to 
assure as closely as possible an equal probability for the occurrence | 
of each type of response. 


Test Scores, Norms, and Interpretation. 


Four primary scores are indicative of the intensities of behavior | 
in the independence, cooperation, defensiveness, and aggression: 
regions. Their magnitudes are determined by the mean amounts | 
of money allocated by the subjects to these respective areas. Other - 
Scores are possible, pertaining to the allocation trends and overall 
success of the player. 

Variable test norms are implemented and updated after each - 
testing run. This feature provides for an immediate analysis of _ 
each particular game, Interpretation of the game can be suppressed 
or used for the discussion of the game with the participant. АВ 
example of an interpretation is shown as follows: 3 

Patricia, your scores for independence, cooperation, defensiveness, and E 
aggression are 3.50, 1.75, .81, and 2.94. I administered this personality 
test to 70 subjects and as compared with them . . . Apropos, do you | 
like to compare yourself to others? [№]. O.K. but you will never | 
know. Changed your mind? [Yes]. In terms of T scores (where 
mean = 50 and standard deviation = 10) your standing оп these four | 
scales is 52.31, 44.22, 48.70 and 52.77. In plain language, you are: 
autonomous, cooperative, protective, and combative, Your gains W 
average. Thank you for playing with me. Take care. Bye. 


Test Utility and Applicability 


ECHO was developed as an attempt to improve the classic paper 
and pencil test format by means of a testing strategy which | 
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measures actual behavior of both theoretical interest and practical 
significance. Pilot experiments showed moderate correlations be- 
tween test scores and several factorial scales of Cattell's Sixteen 
Personality Factors Questionnaire (Cattell, Eber, and Tatsuoka, 
1970), as well as the scales of Rosenzweig’s Picture-Frustration 
Study (Rosenzweig, Fleming, and Clarke, 1947). These preliminary 
findings should be followed by test development and validation 
prior to making any conclusions in practical testing situations. 
Contingent upon standard development and validation procedures, 
ECHO has the potential of becoming a valuable addition to the 
behavioral researcher’s testing inventory. 


Availability 
For extended documentation write to Research and Development 


Center, 13 Pattee Hall, University of Minnesota, Minneapolis, 
Minnesota 55455. Please specify OP-30, 1974. 
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А PROGRAM FOR COMPUTING RANK CORRELATIONS 
FROM ORDERED CONTINGENCY TABLES 


LEWIS R. AIKEN 
Sacred Heart College 


Formulas and a FORTRAN program for computing Kendall's 
tau as well as a generalized Spearman rho coefficient from ordered 
contingency tables are described. Relative advantages and disad- 
vantages of tau and rho as measures of association are considered. 
The program can be used for analyzing ranked data from many 
subjects on two variables or for comparing the responses of one 
subject with the correct responses to an ordered multicategory 


item. 


Тнв problem of what to do with tied or identical ranks when 
computing rank-order correlations is usually not dealt with satis- 
factorily in psychological and educational statistics books, Glass 
and Stanley (1970) discussed the problem of tied ranks in more 
detail than the author of many other books, but they referred 
the reader to Kendall (1970) for other developments, Kendall 
(1970) generalized the problem of tied ranks to ordered R X C con- 
tingeney tables, and provided two tau coefficients for this situation: 


n = 28/ VTU, and (1) 


wwe e 


І 


Те 


In these formulas, 
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т = (R + C — |R-C|)/2; fy is the frequency (number of ob- 
servations) in the ijth cell of the contingency table. 
Generalizing Spearman’s rank-order correlation to ordered con- 
tingency tables, the following formula was derived by the writer: 


р = (Dmax + Dain — 2D)/(Dai — Dain), (3) 
where 
DD È i= p’, 


Dmax is the maximum possible value of D and Dai, the minimum 
possible value of D for the given table of- data. Dmax and Dyin 
can be quickly computed by a simple procedure devised by the 
writer. Unlike ть and re, regardless of the marginal totals of the 
contingency table, the range of p is always —1.00 to +1.00. 

When R = С = 2, formula 3 reduces to: 


М- Ін TA fazl T > ng fal = 2(f + fa) г (4) 
Nig ІШ ж faal ла [m FA fal 

And in the special case where the row marginal totals are identical 

to the column marginal totals, formula 4 further simplifies to: 


o =1 — 20 + fa)/(N — [f — foo) 6) 

The values of ть, т,, and p are quite similar when the marginal 
frequency splits are not too extreme. Otherwise, the tau coefficients 
have an advantage in that their probability distribution is known 
(see Kendall, 1970). On the other hand, р has an advantage over т 
in that the range of the former coefficient is always —1.00 to +1.00. 
In the computation of p, Breater weight is also given to larger dif- 
ferences between i and 1. 

Since it was consitlered potentially useful to know all three 
coefficients for ordered contingency tables, à FORTRAN program 
was constructed. The program was written to compute the con- 
tingency tables and rank intercorrelations among 25 or fewer vari- 


р == 


1 Copies of the computer program and directions for its use will be sent 00 
request by writing to: Lewis R. Aiken, P.O. Box 8884, Guilford College, Greens- 
boro, N. C. 27410. 
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ables, although more variables can be easily accommodated by 
changing the DIMENSION statement. The output of the program 
consists of the observed cell frequencies, the marginal totals, and 
the values of Рах, Dmin D, p, ть, and те for each pair of variables. 
The data deck is headed by a single parameter card punched ac- 
cording to format statement 1. The number of variables and the 
number of subjects, respectively, are punched in the first two 
four-column fields of this сага. These two numbers are followed 
in successive two-column fields of the same card by the number 
of ordered categories in each respective variable. The rank data 
of a given subject on all variables are punched in successive columns 
of one card according to format statement 2. 

In addition to analyzing ranked data obtained from many sub- 
jects, the program can also serve as а means of comparing the 
rankings given by а single subject with the correct rankings of 8 
large group of items. Тһе correct category placements of the items 
may be represented by the rows of a contingency table, and the 
categories in which the examinee actually places the items by the 
columns of the table. 
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COMPUTER PROGRAM FOR THE SELECTION AND 
COMPUTATION OF MEASURES OF RELATIONSHIP 


ROBERT A. SMITH, MARK JAMES, AND WILLIAM B. MICHAEL 
University of Southern California 


> Relative to given scale properties of each of two paired variables, 
a computer program for the identification and computation of the 
following indices of relationship is provided: phi, Spearman rank 
order, Kendall’s Tau, Pearson's product moment (involving two 
continuous variables), biserial, and point biserial. 


Tus program provides a procedure to identify in terms of the 
level of measurement which of the several measures of correlation 
would be most nearly appropriate for data analysis. The identifica- 
tion procedures are based on the recommendations of Glass and 
Stanley (1970, pp. 156-181)) and Fox (1969, р. 232). Once the 
identification procedure is completed the program. then computes 
the selected statistics employing the ІВМ Scientific Subroutine 
Package GH20-0205-4 (IBM, 1970). 

_ Statisties which will be computed for each criteria are indicated 
in Table 1. 


Procedure 
, The program deck must have a control card with these instruc- 
tions to implement the procedure: 
Column 1. If the desired statistic is known it ean be ordered by 
the specified code (Format ID : 
1 — Phi Coefficient 


2 — Spearman Rank Order Correlation 
3 = Kendall’s Tau 
4 — Product Moment Correlation 


5 = Biserial Correlation 
Copyright © 1975 by Frederic Kuder 
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6 — Point-Biserial Correlation 
0 or blank = the routine will use the criteria specified in 
Columns 2 through 4 to determine the appro- 
priate routine to use | 
Columns 2-4. These columns specify level of data measurement. 
It is possible to specify multiple levels if desired. If the routine 
cannot ascertain which routine will be used, it will print out a 
table indicating the routines available and the criteria used 
for each routine. If more than one routine meets the specified 
criteria then both will be run (Format 3A1) : 


A = Ordinal data 

= Interval data 

= Both variables dichotomous 

= One variable is a false dichotomy 

One variable is a true dichotomy 

= Both variables continuous 

= One variable continuous 

= Both variables have multiple categories 


Column 5. For the biserial routines this column identifies which 
variable is to be treated as a dichotomy. A code of 1 will 
identify the first variable. The program will default to the 
second variable with this column blank (Format 11). 

Columns 10-19. Identifies the dichotomizing value for the first 
variable if needed. If none is indicated, the program will use 
the median value (Format F10.0). . 

Columns 20-29. Identifies the dichotomizing value for the second 
variable if needed. If none is indicated, the program will use 
the median, value (Format F10.0). 

Data cards follow the identification card. All data are read 


HOSE DOO 
І 


ұш for two variables (Format 2F10.0). 
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INTERVAL SCALING USING PAIR COMPARISONS OR 
PAIR COMPARISON TREATMENT OF COMPLETE 
RANKS UNDER CASE III ASSUMPTIONS 


| ROBERT J. WHERRY, SR. лхо CHESTER А. SCHRIESHEIM 
The Ohio State University 


e Two of the better approaches to the scaling of stimuli are the 
methods of pair comparison and pair comparison treatment of 
complete ranks under the assumptions of Thurstone’s Case III. 
This paper outlines a computer program with four data input 
options which scales up to 40 stimuli using these methods. The 
program will also compute an absolute scale zero point if the 
user employs the Horst method of balanced values (acceptance 
or rejection of pairs) in addition to the pair comparison: procedure. 
Output of the program is detailed and includes, in addition to scale 
values and standard deviations, several matrices which allow Case 
IV or V sealing by hand with minimal effort. 


Two of the better approaches to the scaling of stimuli are the 
methods of pair comparison and pair comparison treatment of 
complete ranks. These procedures are most often applied under the 
assumptions of Thurstone’s Case V because of the additional labor 
involved in scaling under Case II. Also, while the Horst method 
of balanced values (acceptance or rejection of pairs) yields an 
absolute scale zero point (Guilford, 1936), this method is usually 
not employed because of the computational effort involved. Since 
Case III scaling is more accurate than that for Case V (it does 
not assume equal discriminal dispersions, but instead incorporates 
Variability into the computation of scale values) and because the 
Horst method produces scales which approximate ratio measure- 
ment, a computer program which allows the application of these 
methods is of considerable value to the scale-builder. 

The THURSCALE program is designed to scale up to 40 stimuli 
Copyright © 1975 by Frederic Kuder 
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using one of four data input options: (1) complete unordered pairs, 
(2) complete or incomplete preordered pairs, (3) ordered pairs with 
Horst balanced values, and (4) complete rankings of stimuli. Only 
one control саг is required for operation, in addition to the data 
to be scaled. 

The program is written in FORTRAN IV for the IBM 360/75 
and 370/165, but should be compatable with nearly any FORTRAN 
compiler. It includes extensive comment cards which provide card 
input specifications and which label and describe each major sub- 
routine. Output includes: (1) the matrix of input proportions, (2) 
the matrix of ordered proportions, (3) the matrix of ordered 
z-scores (useful if Case IV or V scaling by hand is desired), (4) 
Case III scale values for each stimulus, (5) standard deviations for 
each stimulus, and (6) scale distances between adjacent stimuli. 

A listing and writeup of THURSCALE along with sample input 
and output can be obtained by writing to Chester A. Schriesheim, 
Department of Psychology, The Ohio State University, 404-C West 
17th Avenue, Columbus, Ohio 43210. Punched deck copies are also 
available at cost. 
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A SUBROUTINE FOR COMPUTING ITEM EFFICIENCY 
AND ASSOCIATED PROBABILITIES 


RICHARD J. HOFMANN 


Miami University 


A subroutine for use іп any item analysis program is provided. 
This routine computes the efficiency of a test item, the exact 
probability of obtaining this efficiency, and the chance probability 
of obtaining a greater efficiency for an item, given its difficulty 
level, discrimination index, and sample size. 


Tue efficiency index is a value ranging from zero to unity. It is 
a measure of the functional discrimination of a test item given 
its observed difficulty level and discrimination index. The greater 
the magnitude of the index the more efficient are the observed dis- 
criminations. It is computed as a function of both item difficulty 
and item discrimination (Hofmann, in press). 

This subroutine is intended to be used in conjunction with any 
test analysis program that generates difficulty and discrimination 
indices based upon some upper-lower group split. Specifically, it 
will calculate item efficiency, the exact probability of the observed 
efficiency, and the chance probability of obtaining a greater effi- 
ciency index, given a particular sample size and the associated 
discrimination and difficulty indices. 

In theory the subroutine is a special application of the Fisher 
exact test (as discussed by Hofmann, in press). The actual com- 
Putations, however, are made through the use of the gamma func- 
tion (T) and natural base e logarithms (1n). Assume a two way 
table of the following form. 
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Group 1 A А-В 
(Upper) 
Group 2 с C+D 
(Lower) 


Ae BA Dy A B+ C+D 
The exact probability of obtaining this particular table given the 
restriction of fixed marginals is determined as p, where 


z=mT(A+B41)+mrC+D+1) 
+mT(A+C4+1)4+hT(B+D+1) 
—-mT(A+B4+C4+D+)1); (1) 

InT(A--1)inrT(B4-1)--inr(C--1)--hr(D-c-1; (9) 


en 


y 
p- 

Following Hofmann (in press) the exact probabilities of suc- 
cessively less independent tables are computed and summed, under 
the restriction of fixed marginals, to yield the probability of ob- 
taining а greater efficiency index given the item difficulty. 


Language and Capacity 
This subroutine, which is written in double precision FORTRAN 
IV, was developed through the IBM system/360 with a G-level 
compiler. It requires 2066 bytes of storage. Of some concern is the 
total sample size, the maximum value of which is fixed by the 
double precision log gamma function argument range. On the IBM 
system/360 this maximum argument is fixed at 42937 x 10%. 
hor this number will vary according to the computer 
used. t 
The computation algorithm is very efficient, as it required slightly 
less than 3.78 seconds to compute the efficiency indices and asso- 
ciated probabilities for all possible 148 combinations of difficulty 
and positive discrimination indices for a sample of 26 as well as 
to default and correct itself for all possible 148 combinations 04 
difficulty and negative discrimination indices. 


Parameters 


Тһе subroutine is called initially to establish a table of gamma 
values. All subsequent callings for each item must have the diffi- 


RICHARD 7. HOFMANN 193 


culty level, discrimination index, and sample size for the item. The 
subroutine returns unchanged the item difficulty, discrimination, 
sample size and in addition the item efficiency, the exact and 
cumulative probabilities of this efficiency index. 


Availability 


Copies of this manuscript, a listing of the subroutine, and a 
special small illustrative program are available. Тһе illustrative 
program generates sample data and output. These materials may 
be obtained by writing to Richard J. Hofmann, Department of 
Educational Psychology, Miami University, Oxford, Ohio 45056. 
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A COMPUTER PROGRAM TO GENERATE SAMPLE 
CORRELATION AND COVARIANCE MATRICES? 


RICHARD G. MONTANELLI, JR. 
University of Illinois at Urbana-Champaign 


Given a population-covariance matrix, this program сап generate 
any number of sample-covariance (or correlation) matrices, based 
on any sample sizes. If no population matrix is input, the program 
generates random correlation matrices. 


RESEARCHERS interested in conducting sampling studies with mul- 
tivariate techniques like factor analysis need an efficient computer 
program to generate sample-correlation and/or covariance matrices, 
based on any given number of observations (№). Given a popula- 
tion-correlation or covariance matrix, this program can generate 
sample-correlation (or covariance, if the input was covariances) 
matrices. It can also generate random sample-correlation matrices, 
based on normally distributed random numbers. Uses for such 
matrices in factor analysis have been reported by Humphreys and 
Montanelli (1975) and Montanelli (1974). 


Method 


Tho program is based on the method discussed by Odell and 
Feiveson (1966). This method has the advantage that only 
n(n + 1) /2 (n = the number of variables (rows) in the correlation 
matrix) random numbers need to be generated in most cases (except 
When N < n + 30). This method also saves a considerable amount 
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of work in computing the correlation matrix, especially for large М, - 
Uniformly distributed pseudorandom numbers are generated by. 
the multiplicative method, first suggested by Lehmer (1951) and _ 
discussed by Jansson (1966, p. 33). Pseudorandom normal deviates _ 
are provided by transformations given by Box and Muller (1958), 
and pseudorandom chi variates are computed by using an approxi- | 
mation for the inverse function for more than 30 degrees of freedom | 
(Abramowitz and Stegun, 1966, p. 941) ог the method of Box and | 
Muller (1958) otherwise. 


Limitations 


The program is written in FORTRAN IV for the IBM 360/75. 
The work is performed by a subroutine COVGEN which is у=, 
namically allocated by specifying two constants in the small main 
program. Thus, the program, which is totally independent of N, | 
can be trivially altered for any n (up to the amount of memory _ 
the computer has available). The program can generate any number | 
of samples with various Ns from one ог more population-correlation | 


matrices and/or any number of random-correlation matrices in | 
one run. 


Availability 


d copy of this article, the program, and additional documentation 1 
are available from the author at the Department of Computer | 


Science, University of Illinois at Urbana-Champaign, Urbana, 
Illinois 61801. 
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BOOK REVIEWS 


MAX D. ENGELHART, Editor 
LEWIS R. AIKEN, JR., Assistant, Editor 


With this issue, Dr. Dennis M. Roberts of the Department of 
Educational Psychology of the Pennsylvania State University is 
replaced by Dr. Lewis В. Aiken, Jr., of the Department of Psy- 
chology of Guilford College as Assistant Book Review Editor. 

The Book Review section of EDUCATIONAL AND PSYCHOLOGICAL 
MEASUREMENT is not a book approval section. Readers may expect 
adverse comments. We discourage invidious comparisons with com- 
peting books. We prefer to publish reviews on educational and 
psychological measurement, statistical and other methods applicable 
to educational and psychological research, and on applications of 
computers. 

In general, reviewers are assigned books to review by the review 
editors with attention to their abilities and interests. Persons wish- 
ing to volunteer a review should query the book review editor to 
obtain permission in order to avoid undue duplication. 


Michael J. Apter апа George Westby (Eds.). The Computer т 
Psychology. New York and London: Wiley, 1973. Pp. xvi + 309. 
$14.95. 


Oriented toward the needs and interests of the psycbologist, this 
Short, volume is organized into two parts: the first consisting of 
five chapters concerned with the basic principles and techniques 
of digital computers; and the second, of five chapters dealing with 
applications of computers to five areas of substantive interest. In 
addition to the contributions of one of the editors (Apter) to the 
writing of one entire chapter and of co-authoring another, five 
other psychologists were involved in the completion of the eight 
other chapters. That at the time of its preparation all authors were 
members of the Department of Psychology at the University Col- 
lege Cardiff in the University of Wales may account for the es- 
Sential continuity of the text and for the relatively nonfragmented 
treatment of the topics covered. The resulting uniform level of 
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reading ease from this cooperative venture would suggest that this 
well-organized book with its extensive bibliography would be an 
appropriate text for the upper division student or first-year grad- 
uate student in psychology, although the first five chapters would 
be of interest to most students in the behavioral and social sciences. 

In Part I the first two chapters are concerned, respectively, 
with an introduction to computers and with an introduction to 
programming. Each of these two chapters (each co-authored by 

John A. Wilson and Geoffrey Barrett) is easy to understand, rea- 
sonably current, and quite practical in its approach. At a relatively 
elementary level these first two chapters cover much of the same 
material about use of computers for data processing as can be 
found in other general texts as indicated by the annotated refer- 
ences at the end of the second chapter (one of which is, however, 
later than 1969). It is not, until reaching the third chapter, which 
is concerned with the use of computer language in experimental 
control, that specific psychological concerns are emphasized, In 
fact, this third chapter authored by Geoffrey Barrett affords the 
experimental psychologist helpful background information about 
technology for on-line control of experiments. Extending the em- 
phasis on computer methodology in experimental psychology, Chap- 
ter 4 by Godfrey Harrison entitled “The Computer in Psychology 
Experiments” includes material on calculation, collation, generation 
of displays, synchronization of presentation of temporally related 
stimuli, and production of runs of drawings of appropriate stimuli. 
Computer applications to the study of serial learning are also 
discussed along with use of computers in closed loop studies. Ргас- 
tical concerns are cited, and a helpful example is detailed. Finally 
ne Chapter 5 by Michael J. Apter, the relevance of the computer 18 
modelling of behavior from the standpoint of the use of structural 
and functional models is considered. Particular attention is given to 
the role of the computer as an information processing system 
analogous to that found in activities of the human nervous system. 
Interrelationships among theories, models, organisms, and comput- 
ers are described and illustrated, and the technique of computer 
simulation is critiqued. 

, Although applications of computer technology to experimenta- 
tion and modelling of behavior are set forth in Part I of the book, 
five relatively specifie areas of application are developed in Part П. 
Іп Chapter 6 John A. Wilson details the use of the computer in 
experiments on the psychology of perception in both visual and 
auditory domains, and in Chapter 7 Godfrey Harrison describes 
how the computer has been employed in the study of the psychology 
of language. Numerous illustrative examples are cited in bot 
chapters. A relatively short discussion of the computer in the study 
of animal behavior appears in Chapter 8, “The Computer in Com- 
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` parative Psychology,” by Stuart J. Dimond. In Chapter 9, James О. 
Robinson explains how the computer can be used in clinical psy- 
chology. In addition to storing clinical information, the computer 
affords a means for the automation of psychological testing and of 
interviews as well as provides a vehicle for test interpretation, 
diagnosis, and potentially even therapy. How the computer may be 
employed in education and training is the subject of the tenth and 
concluding chapter by Michael J. Apter and Geoffrey Barrett. As 
the reader might surmise, the origin and development of computer- 
assisted instruction (CAI) as well as its techniques are a central 
focus of this chapter. In the estimation of the reviewers, this chapter 
is one of the best they have seen regarding CAI, and the critique 
of CAI is an exhaustive and penetrating one. 

It would appear that the contributors to this volume have suc- 
ceeded admirably in introducing psychology students as well as 
professional psychologists who have not had much experience with 
digital computers not only to their key principles, their language, 
and their actual and potential uses as tools of psychological 
inquiry but also to their applications to a number of specific sub- 
stantive areas. In deemphasizing data processing and statistical 
analysis per se and in treating the computer as a heuristic tool for 
both psychological theorizing and empirical investigation, the 
writers have made an important and perhaps unique contribution 
not only to the teaching of psychology but also to updating of 
knowledge of computers by research psychologists who now may be 
encouraged to apply what they have learned from study of this 
volume to their own areas of investigative interest. 


WILLIAM B. MICHAEL 

University of Southern California 

JoAN J. MICHAEL 

California State University, Long Beach 


J. W. Atkinson and J. O. Raynor (Eds.). Motivation and Achieve- 
ment. Washington, D.C.: V. H. Winston & Sons, (distributed by 
Halsted Press of John Wiley & Sons) 1974. Pp. xi + 479. $19.95. 


There has always been a fascination with the possibility that 
Achievement, is not just a function of competence or good fortune, 
that somehow the willingness to pursue a goal is likewise important. 

1s fascination was given concrete form by David McClelland, 
John Atkinson, and others in their early attempts to assess an 
achievement motive, using fantasy responses to pictorial stimuli. 

ind perhaps there are по two names which are more closely asso- 
Slated with the work on achievement motivation than these. In- 
erestingly enough, however, following an initial shared interest in 
Motivational assessment, their work has, in fact, subsequently 
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proceeded along definitely distinguishable paths. The McOlelland 
path has been characterized by а series of fascinating construct 
validity studies which culminated in a most ambitious consideration 
of the origin and creation of achieving persons who in turn create 
achieving institutions and societies. The Atkinson path is char- 
acterized by a concern for a construction of a theoretical model— 
but a theoretical model which emphasizes the interaction of persons 
with situations to create achievement. The research reported and 
reviewed in this volume is part of the Atkinson path. Indeed, the 
book is primarily composed of papers by Atkinson and his co- 
workers—many of them previously published elsewhere. As such, 
it is а handy compendium of the Atkinson path in its current state. 
And the current state of affairs is—or at least should be—of interest 
to all those concerned with the nature and assessment of achieve- | 
ment. қ 
Тһе program of research reflected in this book is impressive. Of 
partieular interest are the new directions this research has taken 
since the publication of an earlier summary volume (Atkinson and 
Feather, 1966). These new directions inelude a closer look at sex 
differences in achievement, a clear recognition of the importance of 
long-term goals and immediate performance, and a viable concern 
with application. Last, but, certainly not least, this volume reflects 
further specification of the model which has, incidentally, also eul- 
minated in the development of a computer simulation routine. 
Impressive as the work reflected in this volume may be, it never- 
theless prompts certain questions and criticisms. First, I continue 
to wonder about the applicability of this theory of achievement to 
diverse social and cultural groups (see Maehr and Sjogren, 1971; 
Maehr, in press). The definition and measurement of the achieve- 
ment personality is most obviously appropriate when talking about 
the white middle-class male of the Western world. Atkinson and 
his colleagues have, of course, recognized some of the problems here 
and have suggested approaches to solving them. Thus, Matina 
Horner’s work on sex differences in achievement (see Chapters 6 
and 13) does provide at least one explanation for the previous 
failure of the Atkinson model to apply equally to males and females. 
Of special importance in ultimately understanding the actualization 
of achievement motivation in different cultures may be the new 
emphasis on the role of the perceived instrumentality of a task (see 
Chapters 7 through 10). But recognizing such improvements over 
the past, one may still question whether the thematic apperception 
and test anxiety measures commonly employed in this research are 
so thoroughly embedded in and limited to one culture as to 
limited in their application to various sociocultural groups. 
terestingly enough, this volume includes scarcely a reference to the 
problems of cultural diversity in motivational orientation. 
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Second, questions about measurement are inevitable, especially 
since this program of research has been repeatedly subjected to 
considerable criticism in this regard (see, for example, Entwisle, 
1972). This volume contains an interesting defense of the test- 
retest reliability problem (see p. 8ff.) in the use of thematie 
measures of achievement motivation and the addition of new data 
improves the case for validity. However, Atkinson and his col- 
leagues have still not provided the practitioner with anything like 
a Workable device or scheme for assessing achievement motivation. 
Be this as it may, what will be of greatest interest to measurement 
specialists is not the work on motivational assessment per se. 
Rather, it is the criticism of ability and achievement testing from a 
motivational perspective that should prove especially provocative. 
The thrust of the argument here is that the results of achievement 
and ability tests depend rather basically on the motivational orien- 
tations of the person tested. Indeed, Atkinson suggests that the 
results on these tests can justifiably be given a motivational as 
Well as, or instead of, the traditional aptitudinal interpretation. 
Motivational patterns that are elicited in ability testing situations 
may be quite different than those elicited in the course of achieve- 
ment in the "real world." Although this possibility has often been 
Tecognized, satisfactory theorizing 18 seldom to be found. Especially 
in his ACT research institute paper (chapter 20), Atkinson has 
proposed a coherent and heuristic theoretical interpretation that. 
те challenges traditional interpretation of ability assess- 

ent, 

Finally, one may note that as the volume is a reiteration, ex- 
pansion, and clarification of earlier themes, it contains no expanded 
teference to or perspective on certain recent developments in 
achievement theory. Thus, for example, Weiner's recent work on 

0 reinterpretation of achievement motivation in attribution theory 
terms is virtually ignored. Although that is predictable, given the 
intentions of the book in playing out the implications of à pre- 
у developed system, it is still disappointing—disappointing 
Or several reasons. Attributional analyses of achievement be- 
ea are very much a part of the achievement theory scene at 
an Moment. Possibly the analysis of achievement motive in terms 
н Predilections to attribute causes variously to one’s ability or 

Ort or to situational factors such as luck or task difficulty, could 
Жөнін the framework for more convenient assessment of achieve- 

ent orientations, Conveniently administered and readily scorable 
Th of achievement attribution have, of course, been developed. 
Wei of attribution and achievement theory might thus 
Шей, uate in а welcome replacement of fantasy measures of achieve- 
an i orientation. Then too, the attributional analysis has provided 

nnovative perspective in studying the developmental pattern of 
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achievement motivation—a topic of some concern to educators but 
ignored in this volume and seldom effectively treated anywhere. 

All in all, this is a book primarily for scholars concerned with | 
achievement theory and related assessment problems. It provides | 
а convenient updating of an important program of research in the 
area. It is possible that the volume could be effectively employed as 
an adjunct text or as а source of readings for upper-level courses 
in such diverse areas as educational/psychological measurement, 
counseling and сагеег development, and motivational theory. Having 
some experience in using it for a course in the last area, I have con- 
siderable confidence of its applicability in this respect. 
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Clinton I. Chase. Measurement for Educational Evaluation. Read- 
ing, Mass.: Addison-Wesley, 1974. Pp. viii & 312. $9.95. 


This excellent book should promote student acquisition of skills, 
knowledge, attitudes and interest in measurements. These goals 
include efficient writing of classroom tests and the use of appropriate 
standardized tests useful in the evaluation of achievement and in 
obtaining data needed for placement and guidance. Instead 0 
detailed description of specific tests there is emphasis оп the char- 
acteristics of various types of tests—their applications and ей 
limitations. Such concepts as behavioral objectives, cultural bias, 
criterion versus norm referenced testing, kinds of achievement and 
ability scores are simply, but adequately, explained. Similarly; 
lucid explanations are presented with reference to the different types 
of validity and reliability. 

‚ The categories and related behaviors of the Taxonomy of Educa- 
tional Objectives produced by Benjamin Bloom and others, includ- 
ing this reviewer, are effectively summarized. This is followed m 
presentation of the two-dimension grid used in classifying behavior? 
and content objectives without noting that this device was оп 
nated by Ralph У. Tyler. (see his Principles of Curriculum @ 
Instruction.) Е 
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The writing and analysis of different types of objective items and 
éssay questions are reasonably well explained, but no attention is 
paid to the writing of exercises relevant to quoted materials as 
exemplified in the Progressive Education Eight Year Study, the 
Cooperative Study of General Education, nor in this reviewer's 
Improving Classroom Testing. Too much emphasis is given to the 
evaluation of knowledge of facts and too little to evaluation of 
critical thinking skills. 

The measurement of general ability or intelligence and of special 
aptitudes is effectively explained in Chapters 7 and 8. The impor- 
tant achievement batteries are discussed in Chapter 9. Chapters 10 
and 11 are instruments useful in assessing personality traits in- 
cluding interests and attitudes. Chapter 12 deals with character- 
istics of test administrators which influence test performance and 
with such factors as test anxiety, fatigue, and response bias. Chapter 
13 discusses testing programs and the interpretation test results to 
parents. The appendices contain brief but licid explanations of the 
computation of means, standard deviations, and coefficients of corre- 
lation. A useful list is given of names and addresses of ten impor- 
tant test publishers. 
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A Max D. ENGELHART 


Harry Frank, Introduction to Probability and Statistics: Concepts 
and Principles. New York, N. Y.: John Wiley & Sons, Inc., 1974. 
Pp. xvi + 431. $12.95. 


(FIRST REVIEW") 


fee first of the two reviews was assigned by the review editor, the second 
е former assistant review editor. 


* 
by 
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This introduetory textbook is intended for a first Statistics course 
in the biological or social sciences. As such, the contents are fairly 
predictable. The level of mathematics required is minimal. Calculus 
is not used; topics such as summation notation and the binomial 
theorem are reviewed as needed. The treatment is extremely intui- 
tive, and is one of the more successful examples of this approach. 
The style is even, only becoming more difficult toward the end of 
the text. The examples chosen require no special knowledge from 
a subject discipline. 

The first of the 15 chapters deals with the elements of proba- 
bility theory. Topics include events, the sample space, the addition 
theorem, conditional probability, independent events, and the law 
of large numbers. 

Chapter Two treats discrete random variables and their prob- 
ability distributions. The approach taken is to partition the sample 
space into a collection of mutually exclusive, collectively exhaustive 
events. Values of the random variable are then computed or assigned 
for each event. Probabilities are then associated with the values of 
the random variable, rather than with the events themselves. The 
idea of a probability distribution is presented in this context. 

The next chapter discusses elementary finite counting problems. 
Topics include the fundamental principle of counting, permutations, 
combinations, and tree diagrams. A clear and thorough, but intuitive 
derivation of the binomial distribution is given. The connection with 
the binomial theorem is included. 

Chapter Four presents the distinction between a population and а 
sample, between a discrete and a continuous random variable, and 
between grouped and ungrouped data. The next chapter is a stan- 
dard treatment of the mode, the median, and the mean. The treat- 
ment covers both grouped and ungrouped data, and both samples 
and populations. The author carefully distinguishes between p as 8 
population parameter and р аз an expectation. A short section, 
together with an appendix, discusses the algebra of expectations. 

Chapter Six presents the sample variance and population variance. 
The approach follows that of Chapter Five. Chapter Seven shows 
the method for standardizing a random variable. This is then 
illustrated with a binomially distributed random variable. 

The next chapter sets out the basic facts about the normal distri- 
bution. A limit argument is applied to the binomial distribution. 
The histograms used are clear enough, but the final statement of 
the argument on page 151 is misprinted. The method for reading 
the table of cumulative normal probabilities is given, as well as the 
normal approximation to the binomial. 

Chapter Nine discusses sample covariance, the Pearson product 
moment correlation, and the coefficient of determination. The dis 
cussion is limited to the descriptive uses of these statistics. 
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Chapter Ten treats sampling distributions and the properties of 
estimators. For point estimators the author defines the meaning of 
an unbiased estimator, a consistent, estimator, and an efficient, esti- 
mator. The property of linearity is not mentioned. The author 
states “, . . all unbiased estimators are indeed consistent, . . . ." This 
is not true, as can be shown with a simple example. Again, the 
author tells us that, for fixed sample size, X the sample mean is 
more efficient (less variance) than any other unbiased estimator. 
The proof is incomprehensible, and at a minimum fails to prove the 
result. The chapter states the Central Limit Theorem, without proof, 
and shows the confidence interval for the mean. 

Chapter Eleven presents hypothesis testing in general. Topics 

include the null and alternate hypotheses, level of significance, types 
of errors, effect of sample size, and the power of a test. The dis- 
cussion is thorough. In this chapter and later, cases are presented 
where, using reasonable data, the results are statistically significant 
but not scientifically significant. In particular, the sample statistic 
calls for rejection of the null hypothesis even though the same 
statistic is even less likely under the null hypothesis. The author 
suggests a solution which will please few. Terming this occurrence: 
overpowering, he tells us to decrease the sample size, even if it 
means taking a subsample from data already collected. The reviewer 
considers this philosophy unsound. 
_ Chapter Twelve continues the discussion of hypothesis testing as 
it relates to tests about means. The author systematically treats 
most of the combinations of: normal versus non-normal population, 
one mean or two means, variance known or variance not known, and 
small sample or large sample. The nonparametric tests are not 
covered, The introduction of Student's t-distribution as the appro- 
priate density function for the quotient of two random variables is 
exceptionally well done. y 

Chapter Thirteen discusses the two most common tests for vari- 
ances: the test for a single variance based on the Chi-square and & 
test for two variances based on the F-ratio. Chapter Fourteen sets 
out the Chi-square test as applied to goodness-of-fit, and as applied 
to contingency tables. 

The last chapter, Chapter Fifteen, is an introduction to the 
analysis of variance. The discussion is clearer than most in justi- 
fying the basis of the method. The reviewer feels some well-chosen 
graphs would help if they would identify which variance goes with 
Which distribution. 

The topics the author has chosen to include are well presented, 
With а minimum of loose ends. However, topics which have not been 
mentioned include regression analysis, rank correlation, partial 
Correlation, tests for significance of correlation, maximum likeli- 

004 estimation, and stratified sampling. This may pose & problem 
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in choosing a text for a second course if this text is chosen for a 
first course. In addition, it may be desirable to augment the text 
with additional exercises, or with a workbook, since there are few 
student problems requiring algebraie or arithmetic drill. 


THOMAS CHURCH 
Governors State University 
Park Forest South, Illinois 


(SECOND REVIEW) 


The book by Harry Frank is one of several that have come out 
recently designed to provide students in the social and biological 
sciences with an introduction to statistical methods. What is unique 
about the text is that it is intended for use in an introductory course 
at the undergraduate level. The prerequisite skills that are required 
to handle the book are spelled out as two years of high school 
algebra and “some mathematical maturity.” 

The book is divided into three parts: probability theory, prop- 
erties of distributions, and statistical inference. In the first part, the 
author presents the various notions of probability, introduces the 
concept of a random variable, and ends the section with a discussion 
of permutations and combinations, and the binomial distribution. 
The second part of the book deals with descriptive statistics, stan- 
dardization, normal distribution and the concept of covariation. 
The notion of expected values and the algebra of expectation are 
introduced very early in this part. The third part is concerned with 
estimation, sampling distribution, hypothesis testing, tests of hy- 
pothesis on means, variances and entire distributions, and ends with 
an introduction to analysis of variance. The book is concluded with 
an appendix which contains discussions of the summation notation, 
the partitioning of sums of squares, and a more formal treatment 
of the algebra of expectations than that given in the main body of 
the text. An attractive feature of the appendix is that it includes 
the solutions to the problems in the text. 

The material in the text is presented in a lock-step fashion, with 
later chapters depending heavily on the concepts developed in 
earlier chapters. In general, the author introduces new concep 
intuitively either in terms of solving a problem, or with the pre- 
sentation of a concrete example. The concepts and principles m- 
herent in the problem or the example are then abstracted ап 
formalized mathematically. This approach is particularly useful in 
teaching students who lack mathematical training. 

The author’s treatment of most of the topics is lucid and concise. 
An important feature of the book that distinguishes it from many 
other introductory books is that the author introduces the notions 
of random variables and expected values very early and once having 
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introduced these, exploits them to their fullest, advantage. This 
should instill in the students some appreciation for the nature of 
statistics. The diseussion of such difficult topies as unbiasedness, 
consistency, and efficiency, though non-mathematical in nature, is 
extremely clear. 

The chapter on hypothesis testing, despite being a condensation 
of chapters fourteen and fifteen of Bailey (1971), contains clear 
presentations of the types of error, and power. Тһе author discusses 
both simple and composite hypotheses, the distinction between which 
is not usually found in introductory texts. The discussion that 
follows on scientific versus statistical significance is clear and in- 
structive. In concluding the section, the author recommends that 
“.... the investigator always cast a simple alternative hypoth- 
esis. . . .’ However, the argument leading to this recommendation 
stops just short of being convincing. In later chapters, the author 
explains how composite alternative hypotheses could be interpreted 
as “synthetic” simple alternative hypotheses in connection with 
inferences about the mean in one and two sample problems. This 
more than makes up for the deficiency in the earlier discussion. The 
only fault in this chapter is the author's failure to point out the 
use of confidence intervals in hypothesis testing. 

There are several other flaws that mar the usefulness of the book. 
In the expression for the standardized normal distribution, the 
author repeatedly includes the standard deviation outside the ex- 
ponent. In the application of the central limit theorem to the 
distribution of the sample mean, the author's attempt to show that 
the sample mean is expressible as a sum of random variables is 
rather confusing. The author distinguishes between the algebra of 
expectations and algebra of variances. However, whenever he has 
to obtain the variance of a linear combination of random variables, 
he refers to the algebra of expectations. j 

The text contains a brief discussion on experimental design, too 
brief to be of any real value. In his discussion on matching, the 
author says, “Тһе experimenter has access to these variables, how- 
ever, and can therefore balance their effects by matching rather 
than by randomization.” He goes on to say, “.... for each 19-year 
old man in the experimental groups, he would assign a 19-year old 
man to the control group." The literature in social sciences abound 
in the misuses of the principles of matching. The author, by making 
these ambiguous statements, does not improve matters any. We feel 
that the author would have made the book more useful by including 
& more detailed discussion on the fundamentals of experimental 
design and their use in social research. The author includes an 
Interesting and lengthy discussion on the derivation of Student's 
t-distribution. Nevertheless, the derivation is not instructive, and 
Moreover, the discussion is almost identical to that found in Bailey 
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(1971). Given this, the author should have referred the interested 
reader to the above text and devoted the space to more important 
topics. 

The chapter on analysis of variance, though mathematically сог- 
rect for the most part, is rather poor in terms of the presentation. 
The rationale for not testing the means pairwise is not given. The 
author attempts to derive the test statistic intuitively. The deriva- 
tion is misleading and erroneous, a fact that would have become 
obvious had the author attempted to reconcile this statistic with 
that developed more formally later. 

The author’s writing style leaves a lot to be desired. His utter 
disregard for mathematical grammar and his persistent use of 
"therefore" instead of "then" in conditional sentences, are annoying. 
The author's illustrations of statistical concepts with examples 
drawn from his own experience, although amusing initially, become 
tedious quickly. Yet another fault is his choice of examples to 
illustrate the procedures. These primarily revolve around Neandert- 
hal skulls and sports cars. Beginning students often profit from 
examples drawn from their own disciplines. 

Despite these faults, the author is quite successful in introducing 
statisties via the underlying principles and concepts. There are at 
least two other books, Hays (1973) and Bailey (1971), that intro- 
duce statistics іп a similar fashion. The mathematical level of 
Bailey (1973) is higher than that of either Hays (1973) or Frank. 
These latter texts are fairly comparable with regard to the mathe- 
matical skills required, However, Hays (1973) is more extensive in 
terms of both the topies covered and the discussions supporting the 
mathematics within each topic. Bailey (1971) and Hays (1973) 
are intended for an advanced undergraduate or a beginning graduate 
statistics course. Frank’s book is intended for an introductory under- 
graduate statistics course, but it is unclear whether the text is in- 
tended for beginning or advanced undergraduate. Despite the faults 
mentioned earlier and the availability of competing texts, we feel 
the book would be of value for an introductory statistics course for 
advanced undergraduates, 
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Leonard M. Horowitz. Elements of Statistics for Psychology and 
Education. New York: McGraw-Hill, 1974. Pp. ху + 464. 
$10.95. 


This book is designed аз а Базе textbook for an introductory, 
applied statisties course. In the Preface, the author alludes to the 
fact that there are a multitude of introductory /intermediate statis- 
ties textbooks on the market today, most of which are decent, re- 
spectable books with each author professing to offer something 
special. As is the case with these other authors, Mr. Horowitz feels 
that, based upon his teaching experience, he has developed certain 
pedagogical notions which he decided to translate into book form. 
These pedagogical notions take the form of avoiding the cookbook 
approach by first describing the statistical procedure and then 
explaining, through illustrations and examples, the theoretical under- 
pinnings. 

At first glance, the reader is struck by the similarity between this 
book and the above-mentioned multitude of books in terms of topics 
covered. While the author does not suggest a time frame for utilizing 
the book, 16 is assumed that the topies can be covered in а one- 
semester or two-quarter course and therefore, a second glance is 
necessary to determine first of all if there is any new sequencing 
of topies and secondly, whether the presentation of the material 
merits adoption of this book in light of the availability of other 
options, 

Chapter 1 is an attempt to whet the reader's appetite for the 
subsequent, chapters. It is this reviewer's opinion that this attempt 
to synthesize “Philosophy of Science,” "Basic Research Methods,” 
and the “Lying with Statistics” falls rather short of the intended 
goal. It may be viewed as a chapter that a professor might assign 
for students to “read casually.” 

_ As the author points out, there are two schools of thought regard- 
ing the inclusion of “old-fashioned” topics; i.e., frequency distribu- 
tions and elementary descriptive statistics, in a textbook of this 
type. He adheres to his philosophy of not belaboring these simple 
topies in Chapter 2 and does a commendable job of illustrating the 
formula for percentiles. However, in Chapter 3, he belabors exten- 
Sively. The concepts of Central Tendency and Variation, with the 
Accompanying rules of summation, are again simple topics which 
are familiar to most students who would be using this text. It is 
this reviewer's opinion that while it is important for students to 
have “a feeling for” a frequency distribution in preparation for 
theoretical distributions, it is not necessary, and often more con- 
fusing than it is worth, to have students compute means, medians, 
and variances from frequency distributions. With modern com- 
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puters and packaged programs, these laborious and out-dated com- 
putational formulas should be forgotten. 

Chapter 4 finishes the presentation of the Базе statistical con- 
cepts with the normal eurve and z scores. While the discussion of 
this first portion of the chapter is quite well done, two errors in 
content strueture become apparent. First, the author suggests that 
the section on normalizing distributions can be omitted for a shorter 
course. However, this is the section in the book that deals explicitly 
with the scales of measurement, considered by many as an ex- 
tremely important concept in applied statistics. Secondly, correla- 
tion is treated as a "related topic." Again, correlation is more than 
а related topic and deserves more than the eight pages which 
Horowitz has devoted to it. Additionally, there is no mention of 
other correlation coefficients, and the naive reader may be led to 
believe that the Pearson Produet-Moment as defined in this book 
is "correlation." Chapter 12 refers only briefly again to correlation; 
however, it does have an excellent discussion of the relationship 
between linear correlation and linear regression. This section is 
especially meaningful in light of the growth of the Multiple Re- 
gression Approach in behavioral research. 

While slighting measurement scales and correlation, the author 
deals exhaustively with probability and binomial distribution in 
the traditional set theory approach. At times, the reader gets bogged 
down with all the symbolism, and this may impede effective under- 
standing of the concept of probability and how it relates to hypoth- 
esis testing and interval estimation. The soundness of the pedagogy 
" ү exhaustive presentation шау be questioned for this type of 

ook. 

, The author is at his pedagogical best, however, as he begins the 
discussions of Statistical Theory and Hypothesis Testing. At this 
point, he begins to tie together all of the preceding concepts. The 
approach is quite logical with the author introducing new and some- 
times quite difficult concepts for the most part in a very direct and 
readable manner. A weakness in this approach is the return to 
philosophy of science with the Latin terminology (p. 164) to ex- 
plain the purpose of hypothesis testing and the associated errors 
inherent in this procedure. The student may have difficulty in learn- 
ing and understanding the full meaning of rejecting or failing to 
reject the null hypothesis. 

.. The discussion of the sampling distributions of the mean is very 
illustrative, and the direct tie to hypothesis testing and interval 
estimation is logically structured. This is a potential strength of 
the book in this reviewer’s opinion due to the fact that many 0 
the available books deal with these two topies separately and often 
do not tie them together. However, there is a subtle yet important 
flaw in the discussion of the confidence interval for the mean (р: 
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189-90). The author diseusses the confidence interval in terms of 
the sample mean X falling between the two extreme values; i.e., the 
confidence limits, He states that the 9546 confidence interval implies 
that “the probability is .95 that the sample mean lies within 1.96 
standard errors of the population mean," (p. 189) rather than that 
the probability is .95 that the interval X + 1.96s, spans the popula- 
tion mean. This is a critical error in reasoning and seriously effects 
the understanding of the relationship between hypothesis testing 
and interval estimation. 

In the Analysis of Variance chapter, the author basically returns 
to the traditional cookbook approach to teaching ANOVA. While 
he does attempt to explain the meaning of the mean squares, sum 
of squares and degrees of freedom, the reader with a limited analyti- 
cal mind and without a firm grasp on algebraic manipulation could 
easily get frustrated with all of the symbolism. Additionally, only 
one-way ANOVA is presented with an optimal section of a repeated 
measures design. In view of the rather exhaustive discussion of other 
topics earlier and later in the book, a more extensive discussion of. 
higher order ANOVA and the various models (fixed, random, mixed) 
would seem to have been consistent with other seetions of the book 
while also better explaining the method of ANOVA. This presenta- 
tion was considered good at best. 

When the author get to the chapters on non-parametric statistics, 
the introductory discussion is weakened due to the lack of a good 
discussion regarding the scales of measurement. While there is a 
continuing debate concerning the power of non-parametric versus 
parametric statistics for data measured on less than an interval 
scale, the scale of measurement is still regarded as one assumption 
underlying the use of parametric statistics, which the author fails 
to mention. The description of the various x2 distributions is well 
written but possibly more than is needed in a book of this type. 
The reader not totally famliar with ogive percentages and the 
concept of Expected Values may һауе trouble plowing through all 
of the explanations. 

In summary, the author attempted to show the need for yet 
another elementary /intermediate statistics book by presenting the 
Material in а more pedagogical way. While he generally succeeds, 
there are the obvious strengths as well as the glowing weaknesses. 
The strengths are primarily in the writer’s logical presentation of 
the material. While the relevance of the exhaustive discussion of 
Probability and the binomial and x? distributions may be questioned, 
the material is well presented. Also well presented were the sections 
On the normal distribution, the t-distributions, the В error and the 
Power of a statistical test. Major weaknesses include out-dated 
discussion of computations from frequency distributions, limited 


‘cussion of the concept and methodologies of correlation, limited 
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discussion of the scales of measurement, the critical flaw in the dis- 
cussion of the confidence interval, and only a good presentation of 
ANOVA. 


Dennis E. HINKLE 
Virginia Polytechnic Institute and State 
University 


Arthur R. Jensen. Genetics and Education. New York: Harper & 
Row, 1972. Pp. vi + 378. $10.00. 


In this collection of previously published articles, Jensen has 
brought together and updated those most pertinent to the contro- 
versy which has become synonymous with his name. The corner- 
stone in the collection is the 1969 Harvard Educational Review 
(HER) paper which ignited the debate on the origin and nature 
of racial and social class differences in IQ, “How much can we boost 
10 and scholastic achievement?” 

As might be expected from such an emotionally charged issue, the 

debate has generated more heat than light, but as the impartial 
reader of these papers can see, the fault for this lies far more with 
his intemperate critics than with Jensen. This observation is 
strengthened by a most revealing preface, written especially for 
this book, in which Jensen chronicles professional and other reac- 
tions to the HER paper. It should give pause to those who feel that 
m speech and academic freedom are totally secure and impreg- 
nable. 
А Тһе more serious criticism to which Jensen's claims have been 
justifiably subjected has concentrated on two major areas: the 
accuracy and completeness of the genetic arguments employed; and 
the nature and meaning of the measurement process, i.e., the IQ. 

_Two key points in the genetic argument are most frequently 
cited as weak links in the chain of Jensen's argument, The first has 
to do with the relationship between within group and between 
group heritability. In this edition the use of footnotes which cite 
DeFries formula for the exact theoretical relationship helps to 
clarify the issue somewhat. It is difficult for non-geneticists to under- 
stand the arguments completely or to know the appropriate anal- 
ogies to other kinds of data which might permit an estimate of the 
likely relationship, but two inferences can be drawn. First, a high 
within group heritability increases the a priori probability of a high 
between group heritability. Second, the actual figure is an empirica 
question which only the collection of relevant evidence can decide. 
Reasonable critics have suggested that such evidence could well fail 
to support Jensen’s “not unreasonable” conclusion that genetic 
factors are implicated in racial and social class differences in Г 
scores. But it should be noted that Jensen points out both of these 
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inferences in his book, and welcomes the research which could shed 
light on this dark corner. 

The second area of criticism regarding the genetic argument is 
related to questions about the measuring instruments, and is framed 
by the question “When is a difference a deficit?" Some of Jensen's 
critics have argued that simply citing the intellectual component of 
middle class life as the ideal does not make it so, and the sub- 
sequent discovery of а mean score on IQ tests for blacks which is 
15 points lower than that for whites does not thereby constitute а 
deficit. 

This argument has some merit. That constellation of skills which 
this culture has valued highly and deemed "intelligence" is not well 
understood, nor are we safe in assuming that the same or essentially 
similar skills will continue to serve us adequately in the future. 
There is considerable value in diversity, especially genetic diversity. 
But whether or not these cognitive skills are in any sense “better” 
than others, or more important in the long run than certain “non- 
cognitive” skills which may be differently distributed, the fact 
remains that they appear to be the skills essential to success in 
scholastic endeavors and at the more prestigious and higher paying 
occupational levels in this society at this time. Thus the difference 
is clearly a deficit in terms of immediate social concerns. 

Jensen’s major thesis in the HER paper and in this volume is 
that, this “difference-deficit” is most likely genetic in origin and 
that educational methods which rely principally on developing 
and using those cognitive skills in which the deficit is greatest 
serves only to accentuate the differences and render itself worthless. 
The second half of the argument, the educational implications, has 
been less noted and commented upon than the genetic implications, 
but it does deserve more thorough consideration. 

Lest the incautious reader jump to the conclusions that blacks 
should be taught in one way and whites another, Jensen notes 
explicitly several times that the basis for educational decisions must 
be adequate and accurate psychometric assessment of all the in- 
dividual’s skills; e.g., “I have always advocated dealing with persons 
аз individuals, and I am opposed to according differential treatment 
to persons on the basis of their race, color, national origin, or social- 
class background.” (p. 329). Those who view all psychometric tests 
as nothing more than repressive tools of the establishment no doubt 
see such a statement as a self-serving escape clause, but the burden 
Seems to be heavily upon such critics to demonstrate that skills so 
assessed are not educationally and socially relevant. 

Jensen’s principal solution is to make use of what he terms Level 

abilities (“associational” cognitive skills) when Level II abilities 
(“analytical” cognitive skills or “intelligence”) are weak or deficient. 
Level Ш abilities are required for success under traditional educa- 


216 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


tional regimens, but are not necessary to learn the traditional edu- 
cational content, at least through the eighth or ninth grades. Jensen 
further argues that such a level of achievement, considering the 
plight of many lower (Level II) ability children, is worth striv- 
ing for. 

lt is obvious that many problems remain before such analyses 
or recommendations are acceptable, but it is to Jensen's credit that 
he has brought these sensitive issues into the open. Further research, 
especially of the educational aspects, would be most welcome. This 
book, as a collection of several highly relevant papers, is worth the 
price for those who desire more than superficial knowledge of this 
controversial topic. 


Danie. P. KEATING 
University of Minnesota 


Melvin В. Novick and Paul H. Jackson. Statistical Methods for 
Educational and Psychological Research. New York: MeGraw- 
Hill, 1974. Pp. XVIII + 456. $16.50. 


This is the first textbook which gives а fairly comprehensive 
exposition of Bayesian statistical methods directly applied to educa- 
tional and psychological research. As such, it is an important book 
which I highly recommend. 

Its importance will be felt primarily by two groups. Those who 
want to know what Bayesian statistics is all about and who desire 
concrete examples will be well served. Secondly, those of us who 
teach Bayesian methods, either as an integral part of a statistics 
Course or as a special course, will find the book more than welcome. 
Previously, же had to rely on various references and books which 
had little direct application to the behavioral sciences. As а con- 
Sequence, we were strong on theory and weak on real-life applica- 
tions. In addition the lack of a consistent presentation relative 10 
symbols, ete. was tough for both the teacher and student. These 
problems are now alleviated (if not solved). A warning to both 
groups 1s in order, however. Don’t expect that you will be able to 
browse through the book and then appreciate the Bayesian method. 
Partly, this is a function of the complexity of the subject, but also 
because the book is not organized in such a way as to facilitate 
skimming. Various topies are mentioned at several places through- 
out the book, and one really needs to read the intervening text to 
fully grasp the message and/or implications. Chapter 6 entitled, 
“Тһе Logical Basis of Bayesian Inference,” comes closest to 4 

stand-along” section which readers may skim for a quick intro- 
duction to Bayesian philosophy, but I would bet that a re-reading 
of this chapter, after studying the balance of the text, would 2176 
numerous additional insights, 
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For those who will use this as a text in their classes, he warned 
that the reading level is quite high, mainly due to the high in- 
formation density per paragraph. Many examples are given; and 
conscious attempts have been made to aid the student, but even 
the examples must be studied very carefully in order to appreciate 
their import. The mathematical sophistication required is also high, 
but this “goes with the Bayesian territory.” All in all, the teacher 
will have to keep his assumptions about what the students are 
comprehending during a reading assignment to a minimum and 
will have to be prepared to explain and expand upon the text. I 
have, in fact, spent total class periods working through examples 
given in the text. (This has advantages for those not happy with 
just lecturing.) 

Some features of the book are particularly noteworthy. The 
emphasis on concrete applications pervades the entire text, The 
theme is that the methods are being presented to solve real prob- 
lems, and this goes far beyond merely defining x as a test score 
rather than as pounds of fertilizer. Several detailed case studies, 
referring to published literature, make this book fairly unique in 
the educational statistics literature. 

Several places throughout the text include sections marked off 
in large boxes which summarize various features of standard (and 
non-standard) distributions or various formulae needed to work 
particular prior-posterior problems. These are very helpful and 
will be much appreciated by students. 

The set of statistical tables is excellent and is the result of a 
considerable amount of effort by Professor Novick and his asso- 
ciates. As mentioned in the text, they are a subset of a larger col- 
lection of tables, Especially noteworthy are high density intervals 
of the beta, inverse chi, and chi-square distributions, percentage 
points of the Behren’s distribution and probabilities that one beta 
variable is larger than another. No tables of the binomial distribu- 
tion are given even though the binomial is a sampling distribution 
featured in the text. Space is a limitation, but I would have pre- 
ferred a few values of the binomial over the natural logarithm. Й 

In this connection the text includes examples of finding binomial 
probabilities using the National Bureau of Standards tables, and 
in two of the problems at the end of the chapter, the student is 
asked to find probabilities using “the binomial table.” 

On page 116 is an example of finding à cumulative beta prob- 
ability, Again, the reader is referred to tables outside of the book, 
ie, Pearson's tables of the Incomplete Beta Function. The text 
States, correctly, that the entries are tabled for p < 9. The example 
Continues using interpolation in Pearson’s tables and the final result 
is noted, It is not mentioned at all that the beta tables in the back 
of the book ean be used for the same problem. However, these tables 


218 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


are for p 2 q, which makes one wish the other tables had not 
been mentioned. The table entry in the book is to four decimal 
places without interpolation! Curious. 

The writing is precise, sometimes witty and always correct, One 
might disagree with emphasis or interpretation, but there are no 
substantive errors. (At least I couldn't find any.) The book is 
also amazingly free of typographical errors for which the authors 
and publisher should be commended. (I did find one on page 244.) 

The book is organized in three major sections: “Problems, Data, 
and Probability models," *Elementary Bayesian methods" and 
"Bayesian methods for comparing Parameters." The first section 
is an introduction, with data, that really says something, which is 
different, from most statisties books. The second section includes 
analysis of binary data and the one-sample normal model. Some 
general theory and Bayesian concepts are included here. The third 
section includes the two-sample normal model, regression and cor- 
relation, and comparisons of binomial parameters. 

Exercises are given at the ends of chapters. They are somewhat 
skimpy, especially for chapters 5 and 6, but, in general, are ad- 
equate. No answers are given. 

A review is not the place to argue for any philosophy and the 
authors of this text do not spend much space arguing, either. They 
are masters of the understatement in this regard. The methods are 
there. Use them if you like. Some "traditionally" trained readers 
will be uncomfortable reading this book. This will no doubt occur 
when they first see the deliberate insertion of subjective beliefs into 
the analyses of the case studies. Their uneasiness will match the 
feelings I had two days before writing this review. A researcher 
asked me for a second class of transformations he might use 80 
that he could obtain significance—his first tries were unsuccessful. 
He proudly informed me later he had found a nonparametric test 
that did the job—i.e. р < .05! I respectfully ask all readers to 
interpret his analysis. 

There is a package of computer programs which are mentioned 
and illustrated in the text. They are not necessary, but are very 
helpful. I assume one could write to Professor Novick for a listing. 

My only fundamental unhappiness with the book is that 5? and 
Х. are used for the sums of squares and the mean. Every time I 
make a mistake on the blackboard because of this, I will silently 
(or perhaps, loudly) protest. The moral is that we must all stay 
flexible even if it hurts—which advice I especially give to non- 
Bayesians who will read this book. 

Похліо Г. MEYER 
University of Pittsburgh 
1The Iowa Testing Program announces the publication of Tables for 


Bayesian Statisticians. ($15.00 prepaid). A review will appear in the Summer 
issue of this journal. 
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David A. Payne, (Ed.) Curriculum Evaluation: Commentaries on 
Purpose, Process, Product. Lexington, Мазз.: D. C. Heath, 
1974. Pp. v & 357. $7.95 (Paperback). 

At first glance, this volume seems to be merely another of the 
books of readings that are in vogue these days. Closer attention 
shows that it is better done than many others. One of the articles 
was written for this book; one is a paper that seems not to have 
been published previously; one is an abridged composite of two 
that appeared in different journals; many others were abridged to 
suit the editor's purpose. The 44 articles are organized into five 
groups dealing with purposes and problems in evaluating curricula, 
identifying goals and objectives, design and analysis of evaluation 
Studies, measurement techniques and problems, and illustrative 
evaluation projects. 

In a fourteen-page “prologue,” the editor has drawn on selected 
writings and his own experience to describe recent changes in the 
nature of evaluation, characterize evaluation, distinguish between 
evaluation and research, and characterize curriculum evaluation 
in particular. He has written an introduction of some two pages 
for each of the five parts and a paragraph introducing each of the 
articles. These introductions will help the reader to decide which 
articles may be useful. 

One who is looking for objective means to measure the outcome 
of curriculum change will not find them in this volume. But 
neither will he find them elsewhere; for an area so value-laden 
as curriculum such means have not been devised. One looking for 
à model for curriculum evaluation can get some help from the 
articles by Barrow, Stake, and Metfessel and Michael. Perhaps the 
Most useful parts of the volume are negative; many of the authors 
Warn against specific pitfalls that are likely to trap curriculum 
evaluation, Reading the descriptions of the eleven curriculum 
Projects provided in Part V will be a broadening experience to 
those who are not familiar with the quantity and variety of cur- 
Neulum projects that have been developed in recent years. The 
essays by Tyler and Cronbach are particularly challenging in 
Concept, 3 

The book is a compilation of writings by persons whose experi- 
ences and ideas should prove valuable to curriculum maker and 
evaluators, 

WILLIAM H. CARTWRIGHT 
Duke University 


Joseph R. Royce (Ed.). Multivariate Analysis and Psychological 
Theory. London and New York: Academic Press, 1973. Pp. 
Xvi + 567. £8.20 ($23.50 in United States) 

Representing a collection of the 14 papers and of the comments 
and rejoinders to the comments of these papers presented at the 


220 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Third Banff Conference on Theoretical Psychology, which was held 
from September 27 to October 2, 1971 under the cosponsorship of 
the Center for Advanced Study in Theoretical Psychology, the 
University of Alberta, Edmonton, Canada, and the Society of 
Multivariate Experimental Psychology (SMEP) as well as includ- 
ing an introductory overview paper of 13 pages by the editor, this 
volume reflects the basic view that theory construction in psy- 
chology can be greatly facilitated through the application of multi- 
variate research strategies, particularly factor analysis. The con- 
tributions of the distinguished participants have been organized into 
two major divisions: (a) “Part I, Methodological, Pre-Theoretical 
and Meta-Theoretical Issues” and (b) “Part II, Toward a Com- 
prehensive, Multivariate Psychological Theory.” To illustrate the 
issues of Part I, an enumeration of the titles of the eight papers 
and their authors would seem helpful: “Right Answers to the 
Wrong Questions? A Re-examination of Factor Analytic Personality 
Research and its Contribution to Personality Theory,” K. Pawlik; 
“Linear Regression Equations as Behavior Models," К. V. Wilson; 
“How shall We Conceptualize the Personality We seek to In- 
vestigate?” D. W. Fiske; “Prescriptions for a Multivariate Model 
in Personality and Psychological Theory: Ecological Considera- 
tions,” 8. В. Sells; “Multivariate Approaches to the Study of 
Cognitive Styles,” P. E. Vernon; “Comparative Studies of Multiple 
Factor Ability Measures,” S. б. Vandenberg; “Theory of Functions 
Represented among Auditory and Visual Test Performances,” J. L. 
Horn; and “Theoretical Issues and Operational-Informational Psy- 
chology,” J. P. Guilford. Similarly for Part II, the six titles and 
corresponding authors were as follows: “Multivariate Models of 
Cognition and Personality: The Need for Both Process and Struc- 
ture in Psychological Theory and Measurement,” 8. Messick; “The 
Conceptual Framework for a Multi-Factor Theory of Individ- 
uality,” J. В. Royce; “Causal Theories of Personality and How 
to Test Them,” J. A. Gray; "Key Issues in Motivation Theory 
(with Special Reference to Structured Learning and the Dynamle 
Caleulus)," В. B. Cattell; “A Multidimensional Theory of De- 
pression,” T. Weckowiez; and “Тһе Psychological Structure of Peer 
Group Forces in Delinquency,” D. S. Cartwright and Katherine 
Howard. 

In setting the stage for what is to follow Royce describes the 
current state of the art in multivariate theoretical psychology: 
differentiates between the terms "multivariate" and “theory,” 16: 
views several of the technical problems and issues underlying factor 
analysis relative to its use in the development of a comprehensive 
multivariate psychological theory, discusses briefly issues in sci" 
entific methodology as applied to psychology, and endeavors to 
explain the role of theory in psychology in relation to each of four 
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through which advanced scientific disciplines have gone. 
е presents a succinct overview of the conference papers from 
he reader can readily grasp both the basic structure of the 
and the principal methodological and substantive issues 
dered. From this excellent integrative summary the reader can 
bly decide which papers he may wish to study in detail. 
art I the first two papers by К. Pawlik and К. V. Wilson 
to be concerned very much with the underlying tenets and 
ity of the factor analytic model and related linear re- 
on models to theory construction, The authors as well as the 
ts give considerable attention to nonlinear and interactive 
ell as linear components manifest in behavioral data and to 
eed for appropriate modifications in the familiar compensatory 
г model to permit an improved conceptualization and under- 
ling of behavior patterns. Somewhat more substantively ori- 
than are the first two papers, the third contribution by Fiske 
with how to conceptualize and measure personality first 
D identifying and examining several critical problems and 
proposing steps to resolve these problems, although several 
lems are left as issues to be considered. Emphasis is placed 
the need to define constructs in both their behavioral and 
xtual (situational) facets as well as upon extensive psycho- 
efforts in a variety of settings. In the fourth paper Sells 
ors to develop a comprehensive multivariate model of per- 
ity that embraces personal and environmental components 
view of identifying sources of variance attributable to 
“subcomponents. As its main thrust the personality model 
d be toward constructing a functional system of components 


‘ould eventually provide a means for viewing the patterns 
Operational capacities of these components as & basis for a 
logy of personality. Subsequent to the fifth paper by Vernon, 
Lis a scholarly but relatively short historic review of cognitive 
the sixth and relatively long paper by Vandenberg is 8 
effort to present critically relevant data from numerous 
concerning multifactor ability measures for evaluating their 
Iness relative to seven criteria. In the seventh paper, which 
(е specific in its concern with the development of a theory 
chological function of auditory tests, Horn has amassed 
Us data from his own studies and those of several other 
igators and has endeavored to interrelate auditory and visual 
s. As the eighth and concluding paper in Part I, the one by 
on theoretical issues associated with an “operational- 
ational” point of view is an examination of the relationships 
n his approach and each of five historical-theoretical issues: 
faculties vs. mental factors, mental acts vs. mental contents, 
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diffieulties in elementarism (e.g., simple behavior units vs. Gestalt 
configurations), the shorteomings of associationism (including be- 
haviorism), and subjective vs. objective orientations to the study 
of human behavior. In essence Guilford has tried to develop a 
general theory of behavior which clearly has its roots in the Struc- 
ture-of-Intellect model. 

Although the papers in Part I supposedly emphasize “meta- 
theoretieal, methodological, and pretheoretieal [primarily empirical 
or minimally substantive theory oriented]" concerns, the six papers 
in Part II have been categorized as mainly substantive-theoretical 
in that they tend to go beyond the data and to afford a speculative 
explanatory basis for understanding the complexities of human 
behaviors. As the first of six papers in Part IT, the one by Messick 
is an attempt to take into account both structural and functional 
(dynamic) components of behavior. After reviewing Guilford’s 
structure-of-intellect operational-information model and hierarchi- 
cal models of intellect, Messick proposes the need for a model that 
considers the sequence of operations in complex cognitive processes 
as in learning and concept attainment, perception and attention, 
memory and recall, and problem solving and creativity, and then 
goes on to consider stylistic aspects of cognition as reflecting рег- 
sonality dimensions "that cut across affective, personal-social, and 
cognitive domains and thereby serve to interweave the cognitive 
system with other subsystems of personality organization [p. 287]." 
In his comprehensive formulation Messick also considers develop- 
mental changes in cognition as they interact with environmental 
variables. Similarly in the second paper Royce expounds a general 
theory of individual differences relative to which he sets forth а 
hierarehieal model for affective, cognitive, and style structures. 
He endeavors to incorporate process or change characteristics 
within his theory through examining the ontogegeny of factors, 
pointing out hereditary and environmental sources of variation, 
posing a factor-gene model, considering cultural-learning mech- 
anisms in relation to factors, and finally speculating upon the 
neural mechanisms underlying cognitive and affective processes M 
individuality. 

. Continuing with a physiologieal-neurological emphasis, Gray, 
in the third paper, presents a causal theory of personality in which 
he attempts to specify three different brain systems as essentially 
isomorphie with corresponding manifestations of temperament- 
emotionality as conceptualized with some modifications in Eysenck’s 
familiar dimensions (introversion-extroversion, neuroticism, 8 

psychotieism). Evidence relevant to the theory is presented from 
experiments of children's behaviors in an operant conditioning task. 
Also interested in emotionality and the affective side of human be- 
haviors, Cattell in the fourth paper of Part II endeavors to define | 
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motivation measurement by offering a familiar factor specification 
equation. embodying ability, temperament, and dynamic com- 
ponents, of which the third mentioned one includes a modulating 
index that reveals by how much an ambient (existing general back- 
ground) stimulus provokes an increase in a dynamic trait. In his 
discussion of his theory Cattell directs his attention to a number 
of concerns about which he expounds at length: (1) the riddle of 
the nature of the components of motivation, (2) the measurement 
of conflict in the clinical field and the prediction of decision, (3) 
the identification of extra-motivational determiners of conflict in- 
volving possible relationships between measurable personality 
factors and generalized dynamic factors, (4) the introduction of 
three classes of measurable predictors within the context of learn- 
ing theory (general psychologic states, ergic tension levels, and 
reward as tension reduction) with particular emphasis on emotional 
learning, and (5) the need for experimentation to test the work- 
ability of the principles or parameters set forth in the structured 
learning theory. 
_ The two concluding papers in Part II tend to be rather specific 
in emphasis in that the fifth one by Weckowicz is concerned with 
how multivariate concepts and strategies could be useful in de- 
lineating several of the theoretical and practical issues in psycho- 
pathology as in the development of a multidimensional theory of 
depression and in that the sixth and concluding contribution by 
Cartwright and Howard is directed toward the development of a 
multivariate model to describe peer-group forces in delinquency. 
These last two papers serve to illustrate both the power and versa- 
tility of multivariate methodology as a tool in aiding the psychol- 
ogist to work in two distinct areas of great social importance. 
. The evaluation of a collection of papers arising from a conference 
is a difficult task especially when the participants have been given 
substantial freedom in the selection and preparation of their papers, 
for the three broad criteria set forth at the time of the call for 
papers were that each contribution would be multivariate, sub- 
Stantive, and theoretical in scope. Despite the differing emphases 
In the papers, the editor succeeded in organizing and sequencing 
them in such а way as to achieve а relatively high degree of 
Coherence if not unity in the final product. ‘An additional unifying 
agent has been the insertion of an introductory section as well as 
the conclusion of comments from one or more of the conference 
participants and of rejoinder statements from each author, Hence 
is substantial interaction among the membership of the conference 


as had an integrating as well as a clarifying influence that has 
tended to reconcile similarities and differences 1n points of view 
expounded. 


There is little doubt that this comprehensive volume represents 
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much of the latest thinking of several of the most distinguished 
contemporary theorists in psychology. It affords numerous sug- 
gestions—either direct or implied—for needed research efforts at 
а methodological, theoretical, and empirical level. In addition to 
its broad coverage, it makes evident the rich potential of multi- 
variate approaches to theory building and to the advancement of 
psyehology as a science. Important fringe benefits include the ex- 
tensive bibliographies in several of the chapters, an author index, 
а detailed subject matter index, and a workable table of contents. 

In summary this volume is a significant contribution to the 
theoretical literature in psychology. It is also of great importance 
to the psychometrician who can gain improved insight regarding 
the contributions which multivariate analysis can make to the 
conceptualization and understanding of the complexities of human 
behavior. Both the psychological theorist and methodologist should 
have this highly sophisticated and thought-provoking book as an 
integral part of their professional libraries. 


WILLIAM B. MICHAEL EC 
University of Southern California 


Julian C. Stanley, Daniel P. Keating, and Lynn H. Fox (Eds). 
Mathematical Talent: Discovery, Description, and Develop- 
ment. Baltimore, Md.: Johns Hopkins University Press, 1974. 
Pp. хуй + 215. $10.00 cloth, $2.95 paper. 


The nine research and discussion papers contained in this volume 
are the consequence of a five-year project, the Study of Mathe- 
matically and Scientifically Precocious Youth (SMSPY), sponsored 
at The Johns Hopkins University by the Spencer Foundation. 
Julian Stanley, the director of the project, sets the stage in Chapter 
1, "Intellectual Precocity,” with a brief review of earlier work on 
talented youth. The book is dedicated to Lewis Terman, and his 
classic study of the gifted, in spite of its recognized flaws, is high- 
lighted. Several case histories of mathematically gifted young people 
are also discussed. 

Stanley writes so well that the reviewer found himself desiring 
а more comprehensive discussion instead of the short shrift to which 
he was treated in Chapter 1. Although Stanley is justifiably critical 
of society's failure to support research and education of the gifte 
while providing funds aplenty for the educationally disadvantaged, 
it should be noted that the disadvantaged are greater in number 
and a more serious economie problem than the gifted. Whether 
the gifted would sufficiently fulfill their potentialities without 
special treatment is а question unanswered. ; 

Chapters 2 through 4 describe the major findings of the project 
after its first year of operation. The data obtained on 35 mathe- 
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matically precocious boys and eight mathematically precocious girls 
identified in a mathematics and science competition participated 
in by 396 seventh and eighth graders are described by Donald 
Keating in Chapter 2. A number of interesting findings emerged 
from the tests and questionnaires administered: pronounced sex 
differences in favor of the boys, disillusionment with school on the 
part of many gifted students, and no indication of personality diffi- 
culties in the gifted boys. Other findings, e.g. the higher educational 
level of the parents and greater investigative interests in their 
gifted children are less surprising. Furthermore, the range of inter- 
and intraindividual variability on the test and questionnaire data 
are considerable. By way of criticism, it is unfortunate that no 
interviews or observational data of the sort provided by Krutetskii 
(1966) on younger children were collected. 

Continuing with the four core chapters, in Chapter 3 Lynn Fox 
discusses alternative ways of facilitating the development of mathe- 
matically precocious youth—special schools, enrichment, accelera- 
tion, and early admission to college or specific college courses. Based 
on the SMSPY data, Fox makes a strong case for individualizing 
instruction of the gifted, particularly by letting them take college 
courses early. In Chapter 4 Helen Astin discusses the reportedly 
unanticipated finding of significant sex differences in favor of boys 
obtained in the project. Not only did the boys perform at a higher 
level than the girls, but the former reported less liking for school, 
showed an interest and precocity in mathematics at an earlier age, 
tended to be the oldest children from large families, but showed less 
tenderness, sympathy, and sociability than girls. Both gifted boys 
and girls tended to come from achievement-oriented, middle-class 
families and to be rated as likeable children by their parents but to 
show less tenderness, sympathy, and sociability than girls. ) j 

Anne Anastasi's commentary on the precocity project, given m 
Chapter 4, brings out some of its strengths and weaknesses ani 
makes a number of valuable suggestions for further research. In 
addition to being careful with the use of terms such as “latent 
talent,” studies of the childrearing practices of parents of the gifted, 
and the role of social encouragement and special materials in foster- 
ìng sex differences in abilities are some of the recommendations 
made by Anastasi. к 

Chapters 6 through 9 were written especially for this volume, 
and form a compendium of “afterthoughts” concerning the project. 
Tn Chapter 6, Fox describes the results of a special accelerated 
mathematics program participated in by 19 gifted students who 

ad just completed sixth grade. By means of guided independent 
Study, in a very short time these students learned algebra. Analysis 
of test scores and other data obtained from these students revealed 
the importance of the parents’ interests and motivation, and other 
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(verbal) aptitudes as well as mathematical ability in promoting 
success in this subject. Social factors are especially important for 
girls, and interest is more important in determining perseverance 
than in actual achievement. 

Along with ability tests and other measures, the California Psy- 
chological Inventory (CPI) was administered to the 35 mathe- 
matically gifted boys in the main study. The CPI profiles of this 
group were compared with those of an eighth grade random, an 
eighth grade gifted, a high school gifted, and a high school norm 
group. Relying on these data, the authors of Chapter 7 criticize 
the “traditional assumptions” that the gifted are interpersonally 
ineffective or maladjusted. Although the authors employed some 
questionnable statistical procedures and tend to jump to conclu- 
sions somewhat, they make their points that the gifted are not 
generally maladjusted and that acceleration would not have delete- 
rious effects on their personality development. 

The career interests and values of the gifted are discussed in 
Chapter 8, based on administration of the Vocational Preference 
Inventory (VPI) and the Study of Values. The findings that junior 
high students who enter a mathematics and science competition 
have interests in mathematical and scientific occupations are not 
surprising, but some of the specifies, such as the fact that boys 
more than girls prefer investigative occupations, are worth noting. 
Sex differences were also found on the Study of Values scales. 
The theoretical value was highest for most of the 35 male com- 
petition winners, and the girls’ social values were higher than those 
of the boys. The point is again made that academic ability alone, 
in the absence of appropriate interests and evaluative attitudes, 
does not always lead to precocious achievement. It is unfortunate 
that the brief measures of interests and values used here did not 
tap these variables better. The Strong Vocational Interest Blank, 
a depth interview, a good biodata inventory, and careful observa- 
tions of gifted students would have provided more information on 
their interests and personalities. 

The last chapters of the volume give the results of an assess- 
ment of the study habits and attitudes, observations of classroom 
behavior, and teacher and student impressions of five junior hi 
school boys who took a college course in mathematics, The fact 
that these students did well in the class and were assimilated suc- 
cessfully lends credence to the notion that acceleration of gifted 
children is the best policy, especially when there are several such 
children together. 

In an epilogue summary of Chapters 6-9, the editors of the 


volume recapitulate the purposes of the project and some of the | 


questions that were examined. Although many of these questions 
concerning the development of mathematical ability, its effects OF - 
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social and emotional development, and the origins of individual and 
sex differences require more extensive study, the results reported 
here permit at least tentative conclusions on several matters. Pre- 
cocious students' social and emotional development does not appear 
to be unduly harmed by separating them from their age group, and 
they benefit intellectually by being accelerated. 

Although the Study of Mathematically and Scientifically Pre- 
cocious Youth is not of the same magnitude as Lewis Terman’s 
landmark longitudinal investigation, it, represents а much-needed 
reopening of issues concerning the development and training of 
individuals possessing special abilities. In addition to mathematical 
precocity, other abilities demand continued study. As noted in this 
volume, SMSPY is being complemented by a parallel investigation 
of verbally precocious children. 

There are so many unanswered questions on the psychology and 
education of human abilities that it is difficult to say what we 
know for certain and what problems would be worth investigating. 
But in spite of small sample sizes, inadequate instruments, and 
questionable methodology in certain instances, the Stanley, Keating, 
and Fox study provides an impetus and some direction. Perhaps 
its most meritorious contribution, like that of Lewis Terman, is 
to be found in its failure to support the persisting myths about 
the social and emotional development of the mathematically gifted 
and the perils of academic acceleration. Otherwise, as with almost 
апу ambitious research project in psychology, it poses more ques- 
tions than it, answers. 

Lewis В. AIKEN 


John А. В. Wilson, Mildred C. Robeck, and William B. Michael. 
Psychological Foundations of Learning and Teaching. (2nd ed.) 
New York: McGraw-Hill, 1974. Рр. xv & 589. $10.95 and $8.95 
(paperback). 


Psychological Foundations of Learning and Teaching is an out- 
Standing text in the field of educational psychology. These chapter 
titles are indicative of its scope “1. Motivation for better teaching, 
3. Appraisal and objective of education, 4. Farming cognitive asso- 
ciations, 5. Affective associations, 7. Affective conceptualizations, 

‚ Neurology of learning, 12. Development of perceptual abilities, 

Cognitive growth: Piaget's theory, 15. Intelligence: structure 
and function, 17. Evaluation of objectives, 18. Teacher-made tests 
and scales, 19. Statistical methods." The book concludes with a 
22 page bibliography. Largely through the courtesy of the United 

ations, students of different lands are shown in learning situations. 
, Hach chapter begins with a list of behavioral objectives which 
Imply the abilities to be acquired, thought questions to be answered, 
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and technieal terms to be defined and used. For example, Chapter 
6 Cognitive Conceptualizations: Grasping Inherent Relations begins 
with the following list of objectives or goals: 


This chapter is planned to help you structure curriculum so that 
your students will learn relationships that are transferable to new 
situations. You will be able to: 

Observe when a student synthesizes knowledge, or associations, 
to form a conceptualization. 

Plan a series of questions that point to conceptualizations. 

Design a series of steps that lead to discovery. 

Design learning situations to provide checks against which con- 

ceptualizations can be tested. 

Analyze a unit of content into essential features. 

Conceptualize the following terms: 

intuition 

inquiry process 

gestalt 

insight 

intrinsic programming 

inductive learning 


Study of this text should acquaint the student with the contribu- 
tions of Benjamin Bloom, Sigmund Freud, J. P. Guilford, B. F. 
Skinner, L. L, and Thelma Thurstone, Robert Thorndike, Ralph 
Tyler, and many others. Both the Cognitive and Affective Domain 
Taxonomies are carefully described and illustrated. Similarly, Guil- 
ford's model for the structure of intellect and the correlations of the 
Thurstone primary mental abilities with each other and with total 
scores are also explained and illustrated. 

Chapter 17 Evaluation of Objectives includes discussion of ed- 
ucational accountability, reliability, and the different kinds of 
validity—content, criterion related, and construct. Chapter 18 ex- 
plains the construction and use of teacher-made and standardized 
tests and scales. A number of series of exercises are quoted from 8 
folio of Chicago City Junior College, English and General Course 
Examinations. Chapter 19 Statistical Methods covers most of the 
important procedures of descriptive and inferential statistics. 

This admirable book is well worth using as a text and as perma- 
nent addition to ones professional library. 


Max D. ENGELHART 


ЕВВАТОМ 


In E. B. Page's article “Top-down” Trees of Educational Value, 
- which appeared in the Autumn of 1974 issue pp. 573-584, on p. 
579, formula (2) should read: 


В = VÈ Вг = VÈ TS (2) 


Thus the variance explained, R?, would be caleulated by summing 
the squared correlations (Page and Breen, 1973). The writer regrets 
this typographical omission, which was brought to his attention by 
Professor Jason Millman. The logie of the article, and of the sub- 
sequent formulas, is unaffected by the change. 
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CONFIGURAL FREQUENCY ANALYSIS AS A 
STATISTICAL TOOL FOR DEFINING TYPES 


С. A. LIENERT AND J. KRAUTH 
University of Dusseldorf. 


Configural frequency analysis (CFA) is a new method for identify- 
ing types. Types are defined as patterns (configurations) of binary 
variables occurring more frequently than may be expected under the 
assumption of complete independence of the respective variables, 
and are tested for significance by multiple binomial tests or suitable 
approximations. CFA is illustrated numerically by an example. Rela- 
tions to latent class analysis and to factor analysis are discussed. It is 
Suggested to use CFA as a type-defining method instead of factor 
analysis if the variables are linked to each other not only by first but 
also by higher-order associations. 


There are many definitions of the concept of type (see Cattell, 
Coulter, and Tsujioka, 1968), no one being satisfactory from a 
Statistical point of view. In fact, up to now a type has been considered 
“a statistical concept without statistics” (English and English, 1957). 

Since intuitively a type is conceived as a pattern of qualities that 
tend to occur together with high frequency (see Lorr, 1966) a type 
might tentatively be defined as a modal frequency in a discrete mul- 
tivariate distribution. 

р However, modal frequencies may occur from two different sources, 
©. (а) from frequently occurring component qualities, and (b) from 
Interactions which increase frequencies of the component qualities. 
. While source (a) is trivial in producing modal frequencies, source (b) 
I5 Suggested to define a type modal frequency as follows: A type is a 
multivariate class of qualities that occur more often together than may 
© expected by chance from the proportions of the respective qualities 
under the assumption of their independence. f 
For example, extroverted smoking men define a type if and only if 
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they occur more often in a population than may be expected by chance 
from the proportion of extroverts, from the proportion of smokers, 
and from the proportion of men, in that population. 

According to the definition above, a method will be proposed for 
identifying types statistically. Originally introduced as a heuristic 
method (Lienert, 1968) the so-called configural frequency analysis (C- 
FA) recently has been developed into an inferential method (K rauth 
and Lienert, 1972). 


Rationale of CFA 


If t binary variables (р = +, —) are observed іп a sample of М in- 
dividuals, 2‘binary classes or configurations Cj, = 1, 2. --- , 2, occur. 
The C's define a t-dimensional fourfold table with observed configural 
frequencies, f, or configural proportions, р, where 


p - fJN. (0 


Each of the 2¢ observed frequencies is associated with an expected 
frequency e, or porportion P, where 


Р=е/М. (2) 


А Under the null hypothesis of no interaction of any of the t variables, 
i.e., under the no-type hypothesis there is 


p=P forall C's, 0) 


while under the alternative (one-sided) hypothesis of interaction, ог 
type hypothesis there is 


Р> Р foratleastone С, (4) 


For testing whether the null hypothesis (Но) holds, the model must 
be specified for getting expected proportions. 

1. According to the fixed proportions model where the proportions 
my Of the t variables or their binary classes are unknown or postulated 
theoretically, P — т is given by 


PENAL з. 6) 


| 2. In the estimated proportions model the proportions Р аге & 
timated from the variable proportions f;, of the sample by 


P = T$.» = +,-. e 


In general, model 2 is more realistic than model 1, since fixed 


وو ی ттм лт‏ و وو 
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proportions are seldom known except in the case where standardized 
test scales have been dichotomized at their population medians to give 
ть = 1/2. Therefore, in the following, only model 2 will be considered 
for testing. 


Binomial Testing for Types 


When associated with an expected proportion P the probability un- 
der H, that a specified configuration C has an observed frequency 
greater or equal f is given by 


NUN : 
Prob (C) = > ( )r'a = py. (7) 
=f 

A configuration C is assumed to be a type if Prob (С) < а, where а 
is the significance level agreed upon prior to sampling, and valid for 
only one test. 

If, as usual in heuristic exploration, 2 < ғ < 2' configurations are 
tested for types, œ must be adjusted for multiple testing. As Krauth 
and Lienert (1972) have shown, adjustment is most simply made by 
setting 


at = а/г. (8) 


Of course, specified testing of selected r out of 2' configurations has 
to be justified prior to sampling while unspecified testing for all 2' con- 
figurations is not restricted to justification. 


Approximate Testing for Types 


„1. IFN is large and P is not too small, normal approximation to the 
binomial may be made by setting 


40) = (f NP — .5)/V NPQ — P). (9) 


Н, is rejected, if z(C) > 27, where z* is the unit normal deviate as- 
Sociated with at. 

2. Instead of the normal approximation, provided NP > 5 Гога!" 
configurations to be tested, the chi-square approximation 


ХС) = (f= NPY/NP, df=1 (10) 


тау be used. However, since chi-square is sensitive to the alternative p 
> Раз well as to the alternative p < P, X(C) tests tor PEET a a 
for “antitypes” if X*(a*) is set as the critical limit. For testing against p 
> P only, X*(2a*) is the correct limit. For unconventional a*’s, use 
the relation Y? = 22 valid for df = 1. 
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3. If N is very large and P is very small, Poisson-approximation to 
the binomial is most suitable. Further approximations are reviewed by 
Molenaar (1970). 

4. Inevery case the binomial fractiles may be calculated through the 
F-distribution by using the Camp-Paulson approximation (see 
Molenaar, 1970) 


0+ DU — P) i 

N = DP 5. 
ЕС) is to be evaluated for df; = 2(N — f) and df; = 2(f + 1) with Қа”) 
as a critical limit. For unconventional @*`$, the critical limit may be 


obtained by Paulson's normal transformation of the F-distribution 
(see Kendall and Stuart, 1969). 


F(C) = 


Ап Example from Experimental Psychopathology 


= 65 volunteer subjects (Ss) were rated for occurring (+) or not 
idis ) of the symptoms Н = Hallucinations, B = Black-outs, T 
= Thinking disturbances, and A = Affective reactions, under lysergic 
acid diethylamide (LSD). The configural frequencies f of the t = 4 
binary symptoms are given in Table 1 (Data from Lienert 1970). 
1. The estimations of the expected proportions, P, in Table | were 
calculated (a) by counting the four pairs of one-dimensional marginals 


TABLE | 

HBTA f P Prob(C) z Prob (z) 
++++ 12 07696 00371 3.024 00124 
dob 0 04214 1.00000 —2.000 197725 
EFES 1 -07017 99117 --1,972 197570 
++-- 4 103842 24969 647 25872 
d THER 1 05824 ‚97976 — 1.740 .95907 
tats 3 .03189 .34333 1302 38133 
Анар 5 05310 26253 580 ‚28096 
%--- 0 102908 1.00000 — 1.764 96113 
-+++ 8 11544 48038 001 150040 
сезуге 1 06322 98566 —1.840 ‚96712 
ent 3 .10525 .97285 — 1.755 .96037 
-%-- 8 05764 103286 1.998 02286 
—-++ 2 108736 98103 —1.835 .96675 
--%- 7 104784 103542 1.970 02442 
LE 10 107965 03231 1.980 102385 
---- 0 04362 1.00000 —2.025 197857 


CFA оГг = 4 binary symptoms Н = Hallucinations, В = Black-outs, T = Thinking дыш 
ances and 4 = Affective reactions of М = 65 volunteer Ss under influence of lysergic acid 
diethylamide. 
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using point index notation 


fe... = 12+0+1+4+1+3+5+0 = 26, f-... = 65—26 = 39 
fe. -19--0--1--4--8-1--2--8 = 37, f.-.. = 65—37 = 28 
T... = 12+0+14+34+8+41+247 = 34, f... = 65-34 = 31 


|. = 12+1+1+5+8+3+2+10 = 42, f... = 65—42 = 23 


and (b) by getting the 16 factor products of the respective marginals 
and dividing them by 65*: 


Puy = 26-37-34-42/65* = .07696 
Р,,.2 = 26-37-34-23/65* = .04214 


2. The binomial probabilities, Prob, in Table 1, were obtained by 
inserting the £s and the P's in formula 7. Thus, for example 


Prob (++++) = (65) отво)" — 407696)" + 


y (65) отвову"а — 07696) + 


+ (% отвову*а — 07696)” = 00371. 


3. Having no information prior to sampling, all 16 configurations 
are to be tested for types. Assuming a = .10, there is a* = .10/16 = 
100625. Looking at the Prob's of Table 1, it may Бе concluded that the 
only type defining configuration is H + B + T + A +. и 

4. The syndrome defined by hallucinations, black-outs, thinking 
disturbances and affective reactions may be fairly interpreted as the 
‘Psychotoxic basis syndrome" described by Leuner (1962) as 
Characteristic for LSD reaction in normal Ss and neurotic patients. 

5. In order to compare exact Prob's with Prob's obtained by normal 
Approximation, z(C) and Prob(z) have been calculated in Table | in 
addition, Agreement is satisfactory, not influencing any test result. 


Hierarchical CFA 


, Thus far, CFA has been looked at as an inferential method. Look- 
Ing at CFA as a heuristic method only, one may use it to answer the 
Question whether а subgroup of s « t binary variables gives more clear 
Cut configura] types than gives the total group of г variables. For 
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selecting s out of t variables CFA may be performed hierarchically, 
Referring to the example on LSD, the instructions are: 

1. Perform a CFA in Table 1 with four symptoms and evaluate Xz? 
like a chi-square for 2* — 4 — 1 = 11 df, the z's being the unit normal 
deviates associated with the Prob’s. 

2. Perform CFA’s of all (3) = 4 triplet combinations of symptoms 
(HBT, НВА, НТА, BTA) and evaluate each Zz? for 2° — 3 — | = 4 df. 

3. Finally perform CFA’s of all ($) = 6 doublet combinations of 
symptoms (HB, HT, HA, ВТ, BA, TA) and evaluate each Xz? for 2- 
2-1=1 df, 

4. Now select that CFA ош of all 11 CFA's which gives the “most 
significant” 222, 

Proceeding that way with the symptoms in Table 1, the triplet ВТА 
gives the “most significant” chi-square, and the most clear-cut syn- 
drome type В + T + А +. In general, hierarchical CFA is a means for 
eliminating variables irrelevant in defining configural types. 


CFA and Related Type-defining Methods 


CFA is formally and substantially linked to Lazarsfeld's Latent 
Class Analysis (LCA) and to the well known factor analysis (FA). 

1. LCA starts, as CFA, from a pattern of binary variables (see 
Lazarsfeld and Henry, 1968, or Cassady, Miller, and Dingman, 1968) 
but arrives, opposite to CFA, at types of statistically independent 
variables similar to modal frequencies of source (a) mentioned initial- 
ly. Furthermore, LCA types are conceived as points along a one- 
dimensional continuous space while CFA types are points within a t- 
dimensional binary space. Thus, LCA and CFA are incompatible 
type-defining methods, though neither method is restricted to first- 
order associations between variables. 

2. FA is primarily related to hierarchical CFA in so far as both 
methods reduce the number of variables to those relevant for defining 
types or factors. FA differs from CFA in that FA relies only on first 
order associations between binary variables while CFA relies on first 
and higher-order associations. Thus FA and CFA give comparable 
results only if there are no higher-order associations in binary 
variables or hypernonlinear correlations in (dichotomized) continuous 
variables. Both LCA and FA differ from CFA in that (a) CFA is 4 
type-defining method giving unique solutions even as а heuristic 
procedure, (b) CFA is, unlike LCA and FA, an inferential method and 
(c) CFA is completely nonparametric while FA is parametric and LCA 
has parametric implications. The only disadvantage of CFA is that the 
sample size is required to increase exponentially as the number g 
variables increases linearly. 


{ 
‹ 


| 


LIENERT AND KRAUTH 237 


Conclusions 


CFA is suggested as a type-defining method primarily for variables 
related to each other not only by first-order associations, but also by 
second-order and higher-order associations. In case of the LSD exam- 
ple, Black-outs, Thinking disturbances and Affective reactions are 
linked to each other by a second-order association (see Goodman, 
1964), and the type B + T + А + could never have been isolated by 
means of FA or any other method of intercorrelation analysis. 

Higher-order associations and hypernonlinear correlations seem to 
occur mostly in psychopathology as has been shown for depressive 
symptoms (Lienert, Angst, Baumann, Gebert, 1973) and for aphasic 
test scales (Gloning, Lienert, Quatember, 1972). It is, therefore, sug- 
gested that clinical syndromes and types of personality disorders may 
be re-examined by a CFA of their symptoms or traits, especially if they 
have failed to be identified conclusively and consistently by other type- 
defining methods. 

CFA is suggested as the only valid type-defining method, if the 
variables, or some of them, are scaled nominally and multinary. CFA 
of multinary variables is as straightforward as CFA of binary 
variables, if binomial tests or their chi-square approximations are used 
to compare observed with expected configural frequencies, Of course, 
the number of configurations possible will be enlarged in case of mul- 
tinary variables thus requiring a larger sample size for efficient 
binomial testing. 
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METHOD FOR HIERARCHICAL CLUSTERING OF 
A MATRIX OF A THOUSAND BY A THOUSAND! 


LOUIS L. MCQUITTY лмо VALERIE L. KOCH 
University of Miami, Coral Gables 


st hierarchical clustering methods are limited to relatively 
Ш matrices. On the other hand, if personality types exist, there аге 
bly many of them. Consequently, a method is needed for 
ering hierarchically the interrelationships between many 
5, as represented in a matrix of a thousand by a thousand. This 
per develops and illustrates such a method. 


ast two previous studies indicate that if personality types exist, 
many of them (McQuitty, 1954 and 1957). If large numbers 
do in fact exist, studies based on sample sizes which do not 
ly represent them, could fail to yield them just because they 
ed on too few cases. 

tigations of whether or not personality types exist would be 
ted by a method which clusters hierarchically a matrix report- 
Tassociations between a thousand or more persons. This paper 
one such method. The method is both concise and rapid. 


Method 


itions of Types 


method of this paper is developed out of two definitions of 
reciprocal, dyadic types and (2) higher order types. A 
cal, dyadic type is defined in relation to a set of objects; it in- 
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cludes only two objects, О; and О); Object О, is most like Object 0), 
and Object О, is in turn most like Object Ор, i.e., amongst the objects 
of the set. 

Dyadic types (themselves reciprocal pairs) are associated by 
reciprocal pairs into higher order types. In brief, this is accomplished | 
by dropping temporarily one member of each dyadic type and thus 
allowing the remaining members to form new reciprocal pairs, When 
the members of the new reciprocal pairs are clustered they take with 
them into the higher order types all of the objects with which they 
had been previously clustered. 

Higher order types are clusters of objects associated by dyadic types. 
If Object О, is reciprocal with Object О», which is reciprocal with Ob- 
ject Os, which is reciprocal with — — — which is reciprocal with Object 
Оу, then all N objects belong to the same higher-order type. Further 
more, if any Object O, in the latter chain is reciprocal with some other 
Object O,', which is reciprocal with O;', which is reciprocal with O,’ 
which is reciprocal with — — — which is reciprocal with Oy’, then the 
N + № objects belong to the same higher-order type, and analogously 
for all other like appendages. 


Versions of the Method 


Three versions of the method are described and compared ет» 
pirically: (1) Concentrated Clustering, (2) Dispersed Clustering, and (3) 
Median Clustering. Concentrated Clustering is the recommended ver- 


sion and is outlined first, followed in turn by Dispersed and Medi 
Clustering. 


Concentrated Clustering 


lem of ties, and (2) to illustrate its effectiveness even when the classi 
sy depend on small differences in interassociations between 
jects, 

The matrix is shown in Table 1. The highest entry in every column 
underlined. The second highest entry in some columns is also Ш 
derlined. This fact can be ignored for the time being. 

Each of the highest entries is examined to determine whether ог 


Number of Underlines 
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it is reciprocal. An entry т any column, i, and row, j, is reciprocal if, 
and only if, it is highest in both Columns i and j. The highest entry in 
Column A of Table 1 is 30, and it is with Row P, but the highest entry 
in Column A is not reciprocal; Object A is not highest in Column P. 
The highest entry in Column A is for the time being ignored. 
Analogously, the highest entry in Column B is not reciprocal, and it is 
also ignored for the time being. 

The highest entry in Column C is 34 and occurs in both Rows F and 
P. It is also highest in both Columns F and P; it is reciprocal between 
C and Е and between C and P. Column C is redesignated i, and 
Column Е is redesignated jı. Then Column C is redesignated i,', and 
Column P is redesignated /,'. 

The next reciprocal pair, from left to right, in the matrix is in Ob- 
jects D and L. Column D is redesignated Л” and Column L is 
redesignated /,". The only other reciprocal pair in the matrix at this 


TABLE 1 
Illustrating the Initial Analysis of Concentrated Hierarchical Clustering 
Redesignations 
iv a" 
i" j іы je jv л 


АВ С DE Р б ИТШ ИК, ТЕМЕН ОН РОС КН 


0 28 16 22 
1 25 20 18 


4 3 19 30 


ФЕЧЫзНЕЫСЫ 


Рю «pan ле A 


Хе The. basic table is from McQuitty, Price and Clark, 1967. 
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stage is for Objects J and T. Column J is redesignated 1,” and Column: 
T is redesignated дл”. 

Four clusters, of two objects each, have now been isolated: (1) С Е, 
(2) € P, (3) D L, and (4) J T, as indicated by the redesignations of their 
columns in the table. The next problem is to decide which member of 
each cluster better represents the cluster for the next stage of the 
analysis. That member of a pair which participates in more highest 
column entries is assumed to be the better representative. It is 
recognized by the fact that it has more underlines in its row. 

If a tie occurs, the selection is made randomly. This was ac- 
complished in this study by assigning randomly the numbers 1 to 20 to. 
the code letters of the objects, A through T, using a table of random 
numbers. Whenever a tie occurred, the object with the larger code 
number was chosen. The randomly assigned code numbers are shown 
in the last row of Table 1. 

The number of underlines in Row C was five at this stage of the 
analysis, as recorded to the left of Row C in the table. Other underlines 
are added in the course of the analysis, and it is for this reason that 
Row C has six underlines. The number of underlines in Row F is four. 
Object С is by assumption a better representative of its cluster than is 
Е. Row F is marked out and eliminated from the rest of the analysis. 
Analogously, Object C is a better representative than P, and J is a bet- 
ter representative than Т. Rows P and T are marked out and 
eliminated from the rest of the analysis. The members of the other 
cluster, D and L, are tied with one underline each. Object L was 
chosen because it has the larger code number, and Object D was 
eliminated; Row D was marked out. Consistently with having 
eliminated certain rows, corresponding columns were marked out and 
eliminated from the rest of the analysis. This completed Step 1. 

In Step 2, the highest entries in the columns of the reduced matrix 
are underlined, as shown in the table. For example, in Column A, the 
highest entry for the original matrix was 30 in Row P, but Row P was 
eliminated in Step 1. The highest entry in Column A of the reduced 
matrix is 29; it appears in Rows С and 1. The highest entries in each of 
the other columns is as shown by the underlines. 

Every subsequent step is a repetition of Step 1, except each of them 
is applied to a matrix reduced by the operations of the previous step: 
Each step reduces the matrix by one or more corresponding rows and 
columns, as outlined above in Step 1. 

Showing any further analysis in Table 1 would have made the table 
difficult to read in following the description up to this point. The com- 

plete operations of Concentrated Clustering are shown in Table 2, US 
ing the symbols applied in outlining Step 1. The row of Step 3, for €x- 
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ample, shows that C (redesignated is) joined Q (redesignated j) and 
that J (redesignated ia’) joined О (redesignated /з’). The column of Step 
3 shows that Row C at that stage of the analysis had four underlines 
and that Q had one underline. Column and Row Q were marked out in 
Step 3 (as shown by the line through Row Q extending into the 
Column of Step 3) because Row Q had fewer underlines than Row C. 
The column of Step 3 shows that J with two underlines tied O, also 
with two underlines. Consequently, one of the two objects was 
eliminated randomly (by random numbers in this case) viz., the one 
with the lower code number. Object J with a code number of 16 was 
retained and Object O with a code number of seven was eliminated. 
The rest of the analysis continued in the same fashion as just outlined. 
The hierarchical structure which derives from the analysis (Figure 1) 
can be prepared directly from Table 2, but the operation is simplified 
and facilitated by an intermediate table, especially so if the matrix is 
large and diffuse. The intermediate table orders the data according 10 
the size of the reciprocal pairs, which yield the classification, as shown 
in Table 3. In preparing Table 3, the reciprocal pairs of Table 2 were 
first ordered numerically from the largest down to the smallest scores 


Scores at which 
Classifications 


Secur red 
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Ob 
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Figure 1. Concentrated Hierarchical Clustering of the Data of Tables | and 2. 
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TABLE 3 
Scores at Which Classifications Occurred 


Steps, Reciprocal & 


Scores Redesignation Pairs 
34 1СЕ,1' СР 
33 2CI 
32 3с0 
31 1" JT, 2" KS, 4' CN 
30 2'GJ 5CK 
29 6AC 
28 3' JO 
27 1” DL, 4" JR, 7 CM 
26 4 BL, 8 CE 
25 
24 
23 
22 
21 SLR 
20 
19 9CR 
18 
17 
16 
15 10HR 


(Column 1). The largest score of a matrix is always involved in the first 
step; it is reciprocal from the start of the analysis by virtue of the fact 
that it is the largest score in the matrix and therefore largest in each of 
the two columns in which it occurs. 

Step 1 of Table 2 shows that Object C joined F with the largest score 
of 34, and under a redesignation of | and that C also joined P with the 
largest score of 34, and under a redesignation of 17. These two classifi- 
cations are shown as the first entry of Table 3. Step 1 of Table 2 shows 
also that Object D joined L with a score of 27 and under a redesigna- 
tion of 1”. This outcome is shown by I"DL after 27 in Table 3. Step 1 
of Table 2 shows further that Object J joined T with a score of 31 and 
Under a redesignation of 1". This outcome is shown by 1"”ЈТ opposite 
the score of 31 in Table 3. The classification of the other steps of Table 

Were recorded in Table 3 analogously. 

The structure of Figure | was prepared 


the largest score down and from left to rig { } 
: The top score, 34, of the table shows that C joined Е under this score 


and with a redesignation of 1, and С joined P under this score and 
with a redesignation of 1’. These three objects аге plotted accordingly 
in Figure | with a 1 under each C and Е and а I’ under each C and P. 

nder a score of 33, C joined I with a redesignation of 2. Under a 


from Table 3, working from 
ht within every row of Table 
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score of 32, C joined Q with a redesignation of 3. These latter two 
classifications are shown in Figure 1, with a 2 under each C and I anda 
3 under each C and Q. This method of plotting associates the hier- 
archical structure with the steps of the analysis as reported in Table 2. 

Whenever an object of a cluster classifies with another object, as in 
most of the above examples, it takes with it into the new cluster all of 
the objects with which it is already classified. 

Up to this point the plotting was straight forward and simple. But 
under a score of 31, J and T joined each other, as did also K and S. 
None of these objects had already joined the initial cluster, outlined 
above. It was helpful to start each of them on a separate sheet and to 
introduce them into the initial cluster (Figure 1) only when they joined 
it by an association with some member of the initial cluster. 

Cluster KS soon joined the initial cluster under a score of 30 
between C (of cluster CFPIQN) and K (of Cluster KS). Accordingly, 
the KS cluster was at this time attached to the initial cluster of Figure 
1. However, Cluster JT first expanded on its separate sheet into 
Cluster JTGOR and then joined a cluster, DLB, which had been 
started on another separate sheet. These two clusters classified 
together under a score of 21 between L and R. Cluster DLB was at this 
time attached to Cluster JTGOR on the sheet on which the latter 
cluster was initiated. The resultant cluster, JTGORDLB then joined 
the initial cluster under a score of 19 between C and R and was at this 
time attached to the initial cluster of Figure 1. The final classification 
joined Object H with the initial cluster under a score of 15 between H 
and R, and H was attached to the initial cluster. 

_ Within every cluster, the objects were classified together from left to 
right as the classification scores decreased. For example, Objects J and 
T, with the highest score, 31, of the second cluster are shown to the ex- 
treme left of their cluster and are then followed from left to right by С, 
O, and R which joined with scores of 30, 28, and 27 respectively. 
Analogously, the clusters were arranged from left to right as the size of 
the scores which attached them to the initial cluster decreased. 


Dispersed and Median Clustering 


Dispersed Clustering differs from Concentrated Clustering in only 
IE When two objects are reciprocal, Concentrated Clustering 
eliminates from further analysis the one which has the fewer un- 
derlines in its row. By way of contrast, Dispersed Clustering eliminates 
the object which has the more underlines in its row. In between these 
two extremes, Median Clustering alternates; the object with the greater 
number of underlines is eliminated in half of the pairs and the one with 
the fewer numbers is eliminated in the other half. In the case of a tie» 


TW 
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all three methods eliminate one object of the reciprocal pair randomly. 

With a large matrix, Median Clustering makes all choices for 
elimination randomly, because this saves computationally and yields 
approximately the same number of the two kinds of selections. In the 
current (small sample) application of Median Clustering, the first 
choice of a non-tied pair was made randomly (selecting the pair the 
larger code number). This approach selected Object F (with Code 13 
and four underlines—Table 1) over Object C (Code 12 and five under- 
lines). This outcome required the next choice for a non-tied pair to 
be the object with the greater number of underlines. Thereafter, the 
selection in non-tied pairs alternated in terms of the number of 
underlines. 

Hierarchical structures for Dispersed and Median Clustering of the 
data of this study are shown in Figures 2 and 3 respectively. 


A Comparison of the Results from the Three Versions 


Figures |, 2 and 3 reveal that Concentrated Clustering initiated the 
fewest clusters by pairs (5, C-F, C-P, K-S, J-T and D-L) and is fol- 


Scores at which 


Classifications 
Occurred 
13 
14 
15 
16 
17 
18 
19 
20 
21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
Objects 
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Figure 2. Dispersed Hierarchical Clustering of the Data of Tables 1 and 2. 
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Scores at which 
Classifications 


21 
22 
23 
24 
25 
26 
27 
28 
29 
30 
31 
32 
33 
34 
MeCN MM шет та с D 5580.97 
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Steps 
1 1,1* 1,1° iei Mee: 
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4 4,4" 4 4" 
5 5 5 
6 6 6 
A Ж 12, 
8 
9 9 9 
10 10 y 


Figure 3. Median Hierarchical Clustering of the Data of Tables ! and 2. 


lowed in order by Median (6, C-F, С-Р, К-5, J-T, D-L and В-О) and 
Dispersed (8, С-Е, С-Р, А-М, A-Q, B-O, K-S, J-T and D-L). 
Entries at the bottom of the hierarchical structure show that 
Concentrated, Median, and Dispersed versions required 10, 10 and 9 
steps respectively to complete the classification of the 20 objects. 
Table 4 compares the versions in terms of the scores at which the 
classifications occurred. АП three versions required exactly two clas- 
sifications at 34, at least one at 31 and at least one at 27. These require” 
ments are due to the fact that all four of these classifications occurred 


^ TABLE 4 d 
Frequencies and Accumulated Frequencies of Scores at Which Classifications Оссите 


Mean Scores 34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 41 
Concentrated | 2 0 


27.63 Median 


27.00 Dispersed 
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in the first step, and all three versions are identical through the first 
step. Table 4 shows that the Concentrated version obtained larger ac- 
cumulated frequency of classification than the Median version for 
scores 32 through 29 and 27 through 25, equal at 34, 33, 28, 24, 23, 21 
through 19 and at 15, and smaller at 22 and 18 through 16. The Me- 
dian version obtained larger accumulated frequencies than the 
Dispersed version for scores 29 through 27, 25, 22 and 19 through 14 
and equal at 34 through 30, 26, 24, 23, 21, 20 and 13. The mean score 
at which the Concentrated version classified the 20 objects is 27.94, fol- 
lowed by Median with a mean of 27.63 and by Dispersed with a mean 
of 27.00. 

АП of the above findings are consistent with the fact that the con- 
centrated version retains for further analysis from each reciprocal pair 
the object which has the more scores highest with other objects (the 
more underlines in its row). This object has a relatively good chance of 
entering soon into another reciprocal pair and thus of attaching 
another object to its cluster (rather than initiating a new cluster). At 
the same time, the selected object tends to reduce the number of 
reciprocal pairs which can be realized in the next few subsequent steps; 
only one of the objects with which it is highest can form a reciprocal 
pair with it in any one step (without a tie) and by virtue of having it 
highest with each of them, the associated objects can not form a 
reciprocal pair with any other object. : 

By way of contrast with the above conditions, the other two versions 
disperse the underlined entries through more rows. This allows for 
more reciprocal pairs to occur in one or more steps and for the total 
number of steps to be reduced. 


A Comparsion with Reliable and Valid 
Hierarchical Classification 


Reliable and Valid Hierarchical Classifications (McQuitty and 
Frary, 1971) derives from a more stringent definition of types than ap- 
Plied here. A type is there defined as a category of objects of such a 
nature that every object in the category is more like every other object 
in the category than it is like any object in any other category. The 
method was designed to use “that particular set of indices of associa- 
tion which produces the most reliable and valid solution. 

Figure 4 portrays the classification into types by the above method 
for the data of this study. The method produced two classifications at 
Level 4 (Clusters CFPIQN and OGJT), two at Level 3, three at Level 
2 (CFPIQNAMKS, OGJTRBH, and DL), and two at Level 1. The 


clusters are arranged from left to right in the order of joining into a 
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Levels at which 
Classifications 
Occurred* 


First 

Second 
Third 
Fourth 


Fifth 
н 


“Levels sre labeled from top to bottom because the method partitions from top to hottom 


Figure 4. Reliable and Valid Hierarchical Classifications of the Data of Tables | 
and 2 (McQuitty and Frary, 1971). 


larger cluster. The order within an initial cluster is arbitrary; the order 
within each Cluster CFPIQN and AM for example, is arbitrary. 

Table 5 compares the classification results from each version with 
those from Reliable and Valid Classification. In making these сот- 
parisons the arbitrary orders within the initial clusters of the Reliable 
and Valid method were adjusted to maximize their agreement with 
each particular version with which they were compared. For example, 
Figure 4 shows the arbitrary arrangement within an initial cluster to be 
0, С, 1, T, В. The fixed order of these objects in Figure 2, for the 
Dispersed version, is OJTRG, though not in juxtaposition to one 
another throughout; both O and G are separated from the other three 
objects. When Reliable and Valid results were compared with those 
from the Dispersed version at the bottom of Table 5, the above five ob- 
Jects were reordered to read OJTRG, but changes from one initia 
cluster to another, or reordering of clusters were excluded. 

The orders of objects having been established (as outlined above) 
Kendall's tau (Kendall, 1955) was computed for the results from 
Reliable and Valid with those of each of the three versions of the cuf- 
rent paper. The taus are listed in the right hand column of Table 5. 
They show that the results from the Concentrated version is in highest 
agreement with those from the Reliable and Valid method with a tau 


| 
| 
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TABLE 5 
A Comparison of the Three Current Versions with Reliable and Valid 


Tau 
Reliable and Valid СЕРОМ АМ КЅЕ JTGOR BHDL 
Concentrated CFPIQN KSAME  JTGOR DLBH > 
Reliable and Valid CEPINQ MAKSE  JTRGO BHDL 
Median CFPIMA KSNQE JTRGDL вон ii 
Reliable and Valid CFPINQ MAKSE OJTRG BH DL » 


Dispersed CFPIM АМОВО  KSEJTR ШОН 


For Reliable and Valid only, the order is arbitrary within each underlined group. 


of .916, and they are followed in order by the Median and Dispersed 
versions with taus of .842 and .800 respectively. A tau of 0.80 based on 
20 cases is significant at value smaller than .00001 for a one-tailed test 
(Kendall, 1955). These results indicate that all three versions compare 
reasonably well with the Reliable and Valid method. 


A Criterion of Internal Consistency 


In developing a criterion of internal consistency, the original matrix 
is reordered to conform to the hierarchical clustering derived from 
Concentrated Clustering. The entries of the reordered matrix are not, 
however, the indices of the original matrix. They are instead the ranks 
of those entries within the columns of the original matrix. 

The reordered matrix of the current analysis is shown in Table 6. 
The objects are ordered from left to right and from top to bottom in 
the new matrix to conform to their order from left to right in the 
hierarchical classification of Figure 1. As a consequence C and F are 
listed first and second from left to right in Table 6 and are followed by 
P. The entry in each (1) Row C—Column F and (2) Row F—Column 
C is one because 34 (the agreement score between С and F in Table 1) 
is highest in each Column C and Column F. The entry in Row P— 
Column С and Row C—Column P of Table 6 is also 1 because their 
agreement score, 34, is highest in each of their columns. Thirty-four 
appears twice in Column C of Table 1 and is the highest entry in the 
Column. There is a tie. In the case of a tie, the rank value which would 
be assigned if there were only one score at the tied values is assigned to 


all tied scores. 


252 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


TABLE 6 
A Portrayal of the Validity of the Classifications 


ЕЕ РАСММЕТТСОЕКГПІВН 
с келет ОЗО 52111 9.9 4 81310 911 
Е 1 31231183 5 15 14 12 14 14 17 16 I7 I5 
P Ys и 7 7 9 711 13 16 818 
I 97202 84442 1 7111214 41117161715 
Q 4 8 810 610104 7 5 69 31010 6941 
N Е 884579 91110 81314 918 
к 6 7309: 25713028 210 9 119 16 15 17 19 17 15 19 7 
$ 619099; SX 3910! 02 11 10 2 17 16 18 10 17 13 10 15 5 
A Bro 3 0999 49:10 4 411 14 12 10 14 10 10 9 1l 
M 963466784 14 9 715 411 10 16 9 II 
E 10 914 11 10 10 6 5 715 16 18 17 1917 6 716 1 
1 14 15 13 13 10 14 17 18 12 12 17 [OF 1 26 7 478 
Т 13 12 11 12 13 12 14 16 12 10 18 1 222 T 2. 544» 
G 121211 14 6 12 12 16 12 15 15 2 3 7.3. 124 222040 
о 1041110 910 811 7 8 712 3 4 6 4 2.4 ЈА 
R 16 16 17 16 18 16 18 19 18 17 18 4 2 816 r 550 8 
D 16 14 16 15 13 16 12 12 12 131075675 1 4 3 
L 18 17 18 17 17 18 14 13 17 18 12 11 12 314 5 1 25 
В №17 1917 7815 1813 21315 5.5 3 27 2 2 3 
H 19 19 19 19 19 19 14 13 19 19 717 19 19 17 16 10 10 9 


The rank for the next score after the tied score is the rank of the tied 
Scores plus the number of them. In this case, the rank of the tied scores 
is one and the number of them is two to yield a rank of three for the 
next score, 33, in Column C with Row 1 of Tables 6 and 1. Ranks for 
other tied scores were computed in this same fashion. 

Any cluster сап now be examined in terms of all of its entries to as- 
sess how closely it conforms to a type as defined in Reliable and Valid 
Hierarchical Classification (McQuitty and Frary, 1971). Some of the 
more distinctive clusters of Figure 1 are emphasized in Table 5 by 
enclosing them in heavy lines. These were chosen to correspond to the 
major divisions in clusters as indicated by Figure 1. 

If the definition of a type were fulfilled by a cluster, no rank in à 
cluster would be larger than n — 1, where n is the number of objects in 
the cluster; every object in the cluster would be required by the defini- 
tion ofa type to be more like every other object in the cluster than like 
any object in any other cluster. Any exception is called a "spot" (Mc 
Quitty and Frary, 1971). 

There are exceptions to the above requirement. Cluster СЕРІ» 
QNKS, for example, contains eight objects. If it conformed to a (ур 
as herein defined, it would contain no rank above seven. It contains 14 
ranks out of a total of 56 with ranks above seven; it contains ! 
“spots.” However, the cluster does conform reasonably well to the 

definition of types. 

A difficulty with the outcome is that we do not know whether the 
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spots are due to fallibility of data or nonconformity of objects to the 
"above" definition of types. The availability of the methods facilitate 
investigations into issues of this kind. 


The Isolation of Stringently Defined Types 


The methods of this paper were designed to isolate types even 
though they are vaguely delineated in data. The methods can also 
isolate types when they are clearly delineated in data, as will now be 
shown. 

Let a type be defined as a category of objects which possesses a 
unique pattern of characteristics; every object of the type possesses all 
of the characteristics, and no object not in the type possesses all of the 
characteristics; each of these latter objects possesses instead the pat- 
tern of characteristics unique to its type. This definition of types is fur- 
ther restricted to recognize that types higher in a hierarchy (with more 
members) have fewer characteristics in common. 

Let the content on which the objects of an analysis are assessed be so 
chosen that no matrix or submatrix will yield a reciprocal score for 
other than members of a type, i.e., wherever this is possible. The only 
matrix or submatrix in which it is not possible, theoretically, is one 
which does not contain at least two members of a type. 

Let the content on which the objects are assessed be further 
restricted. Let it be so specific to the different types, that a reciprocal 
Score between nonmembers will be radically lower than all other 
reciprocal scores; it can then be spotted and ignored. \ 

Under the above conditions, the method will yield clusters in which 
every member of a cluster is associated with all other members of the 
cluster through reciprocal pairs. Let i,j, and k be any three objects ofa 
higher order type and let them be associated initially by reciprocal 
‘pairs ij and jk. When j is classified with i, it must take with it k to form 
the higher order type ijk (see early section on Initial Definitions of 
Types), or analogously when j is classified with k, it must include i. At 
à more complex level, j must take with it into the higher level type all 
of the objects with which it had been classified up to that time. By this 
Process all objects are classified into a higher order structure. 


A General Evaluation of the Method 


y to analyze rapidly huge 


Uni thod are its abilit : 
nique values of the metho tatistically defined 


matrices of interassociations between objects into statis loosel 
lypes and to do this whether or not the types are precisely or loosely 
delineated by the data. 


The problem, as usual in multivariate analyses of this kind, is to 
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select the proper sample of content on which the objects are assessed 
so that substantive types can be isolated if they do in fact exist. Trial 
and error applications of the method to carefully selected content in 
terms of theoretical departures should facilitate the eventual isolation 
of types if they do in fact exist. 


Summary 


This paper develops and illustrates a rapid method for clustering 
into hierarchical structures large matrices of interassociations between 
objects, up to 1000 X 1000 and larger, and should facilitate the isola- 
tion of substantive types if they do in fact exist. 
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ANALYSIS TECHNIQUES FOR EXPLORATORY USE 
OF THE MULTITRAIT-MULTIMETHOD MATRIX 


MICHAEL L. RAY 
Stanford University 


ROGER M. HEELER 
York University 


Methods for analyzing the multitrait-multimethod matrix are 
reviewed and the results of their application to a classic data set are 
compared. It is shown that different analysis methods can yield 
different validity conclusions, and that the results obtained are partly 
dependent on the subjective judgments of the users. It is proposed 
that several analysis methods should be used in tandem on each data 
set and their results should be examined for convergence. Multitrait- 
multimethod matrices should be examined in an exploratory as well 
as validity testing mode so as not to waste their rich data content. 


THE multitrait-multimethod matrix of Campbell and Fiske (1959) 
has become a standard device for testing the validity of the measures 
used in educational and psychological research. Like all good innova- 
lions, the multitrait-multimethod matrix has spawned a series of 
criticisms and developments of the original paradigm. One important 
line of evolution has been the use of the matrix for exploratory 
Tesearch. A series of techniques that aid exploratory use of the matrix 
have been developed and are reported in this paper, together with a 
Suggestion that the use of several of these techniques in parallel may 
increase the overall validity of the analysis. à 

The classic use of the multitrait-multimethod matrix is as a confir- 
matory technique. The inter-correlations of the several trait-method 
Units are used to test for both the convergent and discriminant validity 
Of the predefined traits, after first establishing that adequate reliability 
15 present. Exploratory analysis of the matrix also tests for the con- 
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vergent and discriminant validity of traits. But the traits tested may be 
revised from those prehypothesized according to the trait-method 
relationship revealed in the matrix. 

The idea of using the rich data content of the multitrait-multi- 
method matrix in an exploratory mode is not new. Campbell and 
Fiske (1959, р. 103) wrote, “We believe that a careful examination of a 
multitrait-multimethod matrix will indicate to the experimenter what 
his next step should be; it will indicate which methods should be dis- 
carded or replaced, which concepts need sharper delineation, and 
which concepts are poorly measured because of excessive or confound- 
ing method variance." More recently Boruch and Wolins (1970), Con- 
ger (1971), and Krause (1972) have suggested that the multitrait- 
multimethod matrix be used in trait development. For example, 
Krause (1972) suggested that failing а convergent-discriminant 
validity test is not, in general, disconfirmative of an instrument's 
validity. Usually failure of a particular criterion is a clue to the 
strategy of instrument development that is required. Thus no single 
study is likely to constitute adequate validation of a measure, but each 
succeeding study should influence both the structure and inference to 
be derived from the next. 

The exploratory approach may be of particular value in association 
with the extended areas of application to which the original matrix has 
been applied. The most complete set of suggestions for extended use 
has been given by Paisley, Collins, and Paisley (1971). They 
demonstrated how the convergent-discriminant process may be used 

to validate five elements of research design through the ten matrices 
formed by all possible pairs of the elements, namely concepts (traits), 
measures (methods), populations, times, and analysis models. Other 
authors have made use of one or more of these matrices to meet par- 
ticular research needs. For example Centra (1971) used a multigroup- 
multiscale matrix in which different populations replaced the methods 
of the usual matrix. Werts, Jóreskog, and Linn (1972) used a traits- 
time matrix formulation as an approach to the analysis of panel data. 
Fishbein (1967) proposed a multiattitude object —multimethod matrix 
to test the ability of behavioral measures to discriminate between at- 
titudinal objects or situations. In these varied applications the ex- 
ploratory development of matrix elements is likely to be at least as im- 
portant as element confirmation. 

Several authors, including Campbell and Fiske (1959), Althauser, 
Heberlein, and Scott (1971), and Krause (1972) have shown that dis- 
parate measurement methods are desirable for a multitrait- 
multimethod matrix. The use of disparate methods has two desirable 

properties. First, it increases the likelihood that the methods are un- 
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correlated. This is essential for the validity of the traditional analysis 
and is desirable for more recent methods. Second, disparate methods 
are required if an adequate sampling of the universe of methods is to 
be obtained. Krause (1972) has noted that an adequate sampling of 
traits, methods, and populations is required if the results of a multi- 
trait-multimethod matrix are to be generalizable. 

Once again, however, these developments increase the need for ex- 
ploratory as opposed to confirmatory matrix analysis. It is unlikely 
that new and disparate methods will produce confirmation without a 
series of exploratory analyses first being conducted. 


Analysis Techniques 


Given a need for an exploratory approach, what analysis techniques 
are available, and what are their strengths and weaknesses? The 
original Campbell and Fiske analysis approach was a simple examina- 
tion of the relative sizes of the matrix coefficients. Convergent valida- 
tion was accomplished by examining the size of the correlations 
(validities) between different measures of the same trait. A three stage 
discriminant validation then followed in which the validities were 
compared with the different-trait, different-method correlations, and 
the overall pattern of correlations was checked for consistency. This 
analysis approach provides for a strict confirmatory approach, 
although validation failures can be used to suggest further measure 
developments. Althauser, Heberlein, and Scott (1970) have noted 
some causal path inconsistencies in the Campbell and Fiske analysis. 
These can be minimized if the methods are orthogonal, the matrix is of 
greater order than two traits X two methods, and if testing effects 
between measures can be avoided. 

A plethora of alternate analysis techniques, mostly mutants of fac- 
tor analysis, have followed the original Campbell and Fiske (1959) ap- 
proach. Space limitations prevent review of all these alternatives, but 
three recent developments will be contrasted: (1) restricted maximum 
likelihood factor analysis (RMLFA), (2) clustering-nonmetric scaling, 
and (3) multimethod factor analysis. The first two are described below, 
and multimethod factor analysis will be covered briefly later in con- 
nection with an actual analysis. | X 

The RMLFA technique is а superior factor analytic technique 
developed by Jóreskog (1969). It allows for the testing of specified 
models of the underlying trait-method factors, thus making clear what 
is being tested. Different models may be tested varying in complexity 
from, for example, an orthogonal traits-factors-only model to a traits- 
and-methods factors model with correlated factors. The goodness of 
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fit of each model can be tested by a X statistic, and measure variance 
broken down into trait, method, and error components. The technique 
can be used in either confirmatory or exploratory mode, i.e., it can be 
used to test the prehypothesized trait-method model, or to examine 
alternative models. 

The technique assumes an underlying linear factor structure to the 
matrix, and the large sample X statistic used in testing model 
goodness of fit requires the assumption that the variables are mul- 
tivariate normal distributed. Correlation coefficients are appropriate 
matrix entries given the linear formulation and metric variable struc- 
ture assumed. 

The cluster-nonmetric scaling technique (Ray, 1973; Shepard, 
1972) has many opposite characteristics to RMLFA. The technique 
uses cluster analysis to group similar measures, and portrays the 
clusters in a nonmetric scaling configuration to aid eye interpretation 
of the groupings. The cluster-nonmetric scaling technique does not 
permit significance testing of prespecified models, but the measure 
groupings indicated by the data are clearly shown. In contrast with 
RMLFA no linearity assumptions are required, and measures of as- 
sociation other than the correlation coefficient can be used. 

These divergent characteristics of RMLFA and cluster-nonmetric 
scaling are advantageous if both techniques are used in parallel on the 
same data set. It will be shown later that even RMLFA, which appears 
to leave little room for analyst bias, yields different results Гог different 
analysts when applied to a common data set. Even more inconsisten- 
cies are to be expected when different analysts each use different 
analysis methods. 

The inconsistencies can be turned to advantage if divergent analysis 
methods are compared in a convergent validation of analysis methods. 
Results common to two or more disparate techniques will be more 
convincing than those which appear in one technique only. 


An Illustration 


A good matrix for comparison of alternative multitrait- 
multimethod matrix analyses is given by Campbell and Fiske (1959, p. 
96, Table 12). It contains the intercorrelations for five traits: "asser- 
tive," "cheerful," “serious,” “unshakeable poise,” and "broad in- 
terests" measured by the three methods of “staff ratings,” teammate 
ratings,” and "self-ratings," for a group of clinical psychologists. 

The original analysis of this clinical psychologist matrix by 
Campbell and Fiske (1959) found the trait “assertive” to have both 
convergent and discriminant validity. The three traits “cheerful,” 


me 
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"serious," and "broad interests" achieved convergent validity and 
substantial discriminant validity. The trait “unshakeable poise” 
achieved neither convergent nor discriminant support. 

This matrix was subsequently analyzed by the third analysis 
development mentioned above: Jackson's (1969) multimethod factor 
analysis. 

For this approach Jackson converts the monomethod triangles (cor- 
relations between traits as measured by the same method) into identity 
matrices and factor analyzes the resultant modified multitrait- 
multimethod matrix. When multi-method factor analysis was applied 
to the clinical psychologist matrix, all five traits were found valid. It 
should be noted, however, that multimethod factor analysis is prone to 
suggesting strong evidence of validity even where other analyses 
methods indicate some traits with weak validity. Further, Jackson's 
approach is limited by the initial conversion of the mono-method 
triangles. In the current example, this results in 30% of the matrix 
entries being discarded, which is a substantial information loss. 

The RMLFA technique has been applied to the clinical psychologist 
data by two sets of authors, Boruch and Wolins (1970) and Jóreskog 
(1971). 

Jóreskog (1971) first fitted a model consisting of five trait factors 
only, The X? goodness of fit statistic yielded by the technique indicated 
that this model was not a good fit to the data, i.e., the X? of 140.46 was 
too large for the available 80 degrees of freedom of the model. 
Jóreskog's next model added three method factors to the five trait 
factors and yielded an acceptable fit. However, the factor intercorrela- 
tions given by the technique indicated that the "staff" and "self" 
methods factors were unit correlated, so these two factors were com- 
bined into one with an acceptable Х? of 61.51 with 64 degrees of 
freedom. This final solution contained five trait and two method fac- 
tors. The factor loadings of each measure on each trait were given, 
together with the factor intercorrelations and the trait-method-error 
variance components of each measure. The factor intercorrelations 
showed that the trait "cheerful" had fairly high correlations with as- 
sertive” and “unshakeable poise” so Jóreskog observed that it was 
probably confused with these traits. He might also have noted the 
fairly high negative correlation between “cheerful and serious." ful 
methods yielded good measures of assertive. The method “staff 
ratings" was best for this trait with a trait variance of .76, method 
variance of .01, and an error variance of .39. “Cheerfulness i and 
"unshakeable poise” were also best measured by "staff ratings, but 
"serious" was best measured by “teammate ratings, and “broad in- 


terests” was best measured by "self-ratings." 
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Boruch and Wolins (1970) used an initial 10 factor model consisting 
of five traits, three methods, and one general factor. This was reduced 
to an eight factor model by combining the traits “assertive” and 
“cheerful” and the methods "staff" and “self.” “Cheerful” was judged 
not to be a distinctive trait because it loaded heavily on the general fac- 
tor and had low loadings on the combined *assertive-cheerful" factor. 
The other four traits were found to be valid because of their low 
loadings on the general factor and high loadings on specific factors. 


“Unshakeable poise” was well measured only by “only staff ratings." , 


It is interesting to compare the analytic solutions obtained by 
Jóreskog and Boruch and Wolins. Both used the same general analysis 
technique and an exploratory mode, but obtained different, well fitting 
solutions. Boruch and Wolins included a general factor; Jóreskog did 
not. Both combined the methods “staff ratings" and “self ratings": but 
while Jóreskog maintained “cheerful” as a separate trait, albeit linked 
with “assertive” and “unshakeable poise,” Boruch and Wolins found 
no valid measure of "cheerful" and combined it with “assertive.” 
These differences illustrate that RMLFA, despite its test statistic, is a 
technique that in part is dependent on the subjective judgments of its 
users. 

It is also interesting to compare the RMLFA solutions with the 
Campbell and Fiske (1959) and Jackson (1969) solutions. Campbell 
and Fiske found the trait “unshakeable poise” invalid but the other 
four traits have satisfactory validity. This is in reasonable agreement 
with Jóreskog (1971) solution, because whilst Jóreskog found “ип- 
shakeable poise" valid, he observed that it could only be satisfac- 
torily measured by "staff ratings." Campbell and Fiske noted an 
agreement between "staff ratings" and "teammate ratings," in con- 
trast to the grouping of “staff ratings" and “self ratings" found by the 
RMLFA analyses. Boruch and Wolins (1970) analysis also differs 
from Campbell and Fiske's in the treatment of traits. “Cheerful” was 
valid for Campbell and Fiske but not for Boruch and Wolins, whilst 
the reverse was true for “unshakeable poise.” The Jackson (1969) 
results differed from all other analyses in finding substantial validity 
for all traits. These differences illustrate that the procedures for 
analyzing multitrait-multimethod matrices do not necessarily yield 
identical results, 

A fifth analysis of the clinical psychologist data was done for this 
paper. This time the clustering-nonmetric scaling approach was used, 
and the results parallel those of Campbell and Fiske. 

Table 1 shows the clusters obtained with Johnson's (1967) 
hierarchical clustering. The clusters are shown both in terms of order 
of formation and in Gruvaeus and Wainer’s (1972) uniquely ordered 
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format. Figure 1 shows the same cluster analysis, at three levels of 
clustering, within a two dimensional Euclidean measure space ob- 
tained from a nonmetric scaling (Kruskal, 1964) of the data. This form 
of display aids perception of the clustering (Shepard, 1972). 

The results indicate that this is a matrix containing considerable 
trait validity, since the clusters mostly form around traits rather than 
around methods. “Assertive” appears to be the strongest trait, while 
“cheerful,” “broad interests," and "serious" also form distinctive 
clusters, as shown in Figure 1. "Unshakeable poise” never forms a dis- 


т 
+ 
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CLUSTERING 
TRAITS METHODS LEVELS 
в = Assertive St-Staff Ratings 
о = Cheerful Т =Teammate Ratings — = 
v= Serious Se=Self Ratings 


+= Unshakeable Poise 
4= Broad Interests 


Figure |. Two-dimensional MDSCAL configuration of the clinical assessment data, 
showing three levels of HICLUS clustering. The "stress" (Kruskal, 1964) of this figure 
was .20. The analysis was checked with a three-dimensional figure having a stress of .11. 
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tinctive cluster, but appears to be associated with “cheerful.” The 
methods “staff rating" and “teammate rating" appear to be closely 
associated. They cluster together first in each of the four traits which 
form distinctive clusters. The visual display from the nonmetric scaling 
further emphasized the pure cluster analysis findings. Those trait 
measures which are not in individual clusters with other measures of 
the same trait, are close to those other same trait measures. The 
“cheerful” and “unshakeable poise” measures are located in the same 
spatial zone. 

Like Campbell and Fiske (1959), the cluster results indicate “аззег- 
tive" to be the strongest trait, "cheerful," “serious,” and “broad in- 
terests” to have good validity, and “unshakeable poise” to have 
minimal validity. Like Campbell and Fiske, the cluster analysis also 
found an association between “staff ratings" and “teammate ratings." 

The cluster results differed from both RMLFA analyses in this as- 
sociating of methods. Both RMLFA analyses associated “staff 
ratings” and “self ratings.” The cluster trait results were similar to 
Jóreskog's results, but differed from Boruch and Wolins (1970) re- 
sults, “Cheerful” was valid for the cluster analysis but not for Boruch 
and Wolins, whilst the reverse was true for “unshakeable poise.” 

If the four techniques and five analyses are considered together it 
can be seen that measures for the traits “assertive,” "serious," and 
“broad interests” are supported by all. Measures for “cheerful” are 
supported by Campbell and Fiske (1959), Jóreskog (1971), and cluster 
analysis, but not by Boruch and Wolins (1970). Thus the 
preponderance of evidence seems to support "cheerful." 
“Unshakeable poise" was supported only by the RMLFA and mul- 
timethod factor analysis results. Jóreskog found only the "staff 
ratings" method of measuring “unshakeable poise” satisfactory. A 
trait with only one measure may be distinct, but its validity is hardly 
proven in a convergent sense. Thus “unshakeable poise” appears to be 
doubtfully measured. There is evidence in both the cluster and 
Jéreskog analyses that “unshakeable poise” is associated with other 
traits. н 

In all five analyses described above, correlation coefficients were 
used as the information statistics in the matrix. Krause (1972) notes 
that measurement sets can be codimensional but have low correlation. 
Thus correlation coefficients under-represent the codimensionality 
present, thereby decreasing the frequency of convergent validation and 
increasing the frequency of discriminant validation. Krause (1972) 
Suggests that other measures, such as coefficients of ordinal con- 
sistency, should be used if possible. 


It is a strength of the cluster-nonmetric scaling analysis that varied 
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information coefficients can be used without contravening the tech- 
nique's data requirements. In general the corroborative power ob- 
tained by "convergence of analysis methods" will be increased if dif- 
ferent information coefficients are used in addition to different analysis 
methods. The cluster-nonmetric scaling analysis method provides this 
opportunity. Unfortunately no other information coefficients were 
available for the clinical psychologists matrix. However, Goodman 
and Kruskal's Gamma coefficient has been used by Ray (1973) in a 
multiple measure study of political attitudes. The political study used 
five measures of three traits measured across precinct groups. A dismal 
picture of weak convergent validation and non-existent discriminant 
validation was found for the traits in their original formulation, but 
the measure clusters suggested several plausible new trait formulations 
that could be of value in further research. 


Conclusion 


Only rarely will a single multitrait-multimethod matrix study suffice 
for validation. More generally a series of studies will hone the limits of 
generalizabilty of a trait-method combination. In these circumstances 
an exploratory analysis approach is desirable so that information use 
from each succeeding study is maximal. Several exploratory analysis 
techniques have been described. If several of these techniques, using 
varied information coefficients, are used in parallel, a valuable re- 
straint is provided on the judgmental bias effects to which each indi- 
vidual technique is subject. This does not replace testing with new data 
sets and the Monte Carlo evaluation of the several techniques, but it 
does increase the likelihood of valid analysis. 
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BEHAVIOR OF THE PRODUCT-MOMENT 
CORRELATION COEFFICIENT WHEN TWO 
HETEROGENEOUS SUBGROUPS ARE POOLED' 


ALAN L. SOCKLOFF 
Temple University 


An equation was derived to determine the relationship between the 
pooled within-subgroup correlation coefficient, rw, and the correla- 
tion coefficient obtained from the total group data, л. It was, thus, 
possible to assess the amount of distortion introduced by pooling 
heterogeneous subgroups. Аз à basis for deciding whether to pool 
two subgroups in order to calculate гу, a two-stage procedure was 
recommended: (1) comparison of the two within-subgroup 7's; and 
(2) comparison of r; and ry. On the basis of results for the second 
stage test, distortion in л, was shown to be a function of the pattern 
of subgroup mean differences, total group sample size, and the mag- 
nitude of r. Implications were discussed. 


FREQUENTLY, in the psychological and educational literature, cor- 
relational studies are reported in which product-moment correlation 
coefficients are calculated between two variables for sets of data 
pooled across two or more, possibly heterogeneous, subgroups. The 
effect of pooling heterogeneous subgroups to calculate а product- 
moment correlation coefficient was first noted by Pearson (1896, p. 
283) and later discussed by Pearson, Lee, and Bramley-Moore (1899, 
pp. 274-278). By way of illustration, Pearson et al. presented correla- 
tional data between length and breadth of skulls for 805 males (7 = 
.0869) and 340 females (r = — 10424). When the two subgroups were 
pooled, an r of .1968 was obtained, and this z was considered to repre- 
sent a large spurious value. These authors concluded: “Тһе correla- 


! Portions of this study were presented at the American Educational Research Asso- 
ciation convention, Chicago, 1974. 
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tion may properly be called spurious, yet as it is almost impossible to 
guarantee the absolute homogeneity of any community, our results 
for correlation are always liable to an error, the amount of which 
cannot be foretold (р. 278).” 

Awareness of the problem was carried through the various revisions 
of Yule’s text as a short section іп a chapter covering miscellaneous 
theorems related to correlations. In the I Ith edition, Yule and Kendall 
(1937) offered the following theorem: “If X and У uncorrelated in each 
of two records, they will nevertheless exhibit some correlation when the 
two records are mingled, unless the mean value of X in the second record 
is identical with that in the first record, or the mean value of Y in the sec- 
ond record is identical with that in the first record, or both (p. 301)." 

From a sampling of recent introductory statistics textbooks in psy- 
chology and education, it was found that writers do discuss the effect 
on the correlation coefficient resulting from pooling heterogeneous 
subgroups (Games and Klare, 1967; Glass and Stanley, 1970; 
Guilford, 1965; Walker and Lev, 1969; among others). Where 
references are made, Dunlap's (1937) paper on the combinative 
properties of correlation coefficients and Lindquist's (1940, pp. 219- 
228) text are most frequently cited. Dunlap, admitting to summarizing 
a "well-known" method, derived an equation for calculating a total 
group correlation coefficient from within-subgroup correlation coefli- 
cients, means, and standard deviations. Lindquist presented a practical 
solution to the problem of pooling heterogeneous subgroups by dis- 
tinguishing the total group correlation coefficient from the pooled 
within-subgroup correlation coefficient, noting that the latter may be 
interpreted as an average of the subgroup correlation coeflicients. 
Under the assumption of homogeneous correlation across subgroups, 
Lindquist argued that the pooled within-subgroup correlation coefli- 
cient is the preferred measure insofar as it is more stable across ran- 
domly drawn subgroups. 

To date, a mathematical formulation of the effect on the correlation 
coefficient from pooling heterogeneous subgroups has been lacking. In 
lieu of such a formulation, textbook writers have tended to stress 
cautious interpretation and the use of subgroup correlation coeffi- 
cients to help provide a rational explanation for correlational results in 
the total group data. The major interest of this paper is the derivation 
of a mathematical formulation and the demonstration of the effects of 
pooling two subgroups on the total group correlation coefficient. Of 
additional interest is a procedure to guide the decision concerning the 
pooling of data for the purpose of calculating a single correlation 
coefficient. 
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Formulation 


Given two subgroups, let n, and n; be the subgroup sample sizes, 
where n, + n; = N. Let U and V be the distances between the subgroup 
means for X and Y, respectively, i.e., И =-Х, — Y, and = Y, — Y 


Sum of Squares and Cross Products 
The sum of cross products for the total group, SS(X Y), is defined 
SS(XY), = SS(XY), + SS(XY), + or) N 
where SS(XY), and SS(XY), are the within-subgroup sums of cross 


products. Since the sum of squares for X in the total group is actually 
the sum of cross products with respect to itself, 


8800, = SSC), + 8800, + v (nas). (9) 
Similarly, for У, 
SS(Y), = SS(Y), + SS(Y), + yin (3) 


Total Group r, 


Using Equations 1, 2, and 3, the correlation coefficient for the total 
group is: 
SS(XY), + SS(XY) + UV(nn/N) 
VISSA): + 550), + Usa, М5) + SSY) + V(nn/N) 


An equivalent form of this equation was derived by Dunlap (1937). If 
SS(XY), = SS(XY) + SS(XY), SS(X), = 55(Х), + 85(Х), and 
SS(Y), = SS(Y) + SS(Y), then the equation defining r, may be 
simplified: 
SS(XY). + UV(nn;/N) (4) 
— V/TSSQO, + Отт) (SS(Y). + V(nm/N)} 


The correlation coefficient for the total group is, therefore, expressed 
in terms of pooled within-subgroup sums of squares and cross 
Products, subgroup sample sizes, and distances between the subgroup 
means. 
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Pooled Within-Subgroup rw 


The pooled within-subgroup correlation coefficient is obtained from 
pooling of the subgroup sums of squares and cross products: 


E SS(XY), + SS(X Y): 
"C VASSE) + 8800: (SSC), + SSY) 


о со 
V 55(Х).55(У).. 


It should be clear from inspection of Equation 5 that for equal 
variances of X in both subgroups and equal variances of Y in both sub- 
groups, 7, is a weighted arithmetic mean of the within-subgroup cor- 
relation coefficients, weighted by the number of observations in each 
subgroup. 

Furthermore, ry may be compared to ғ; in two ways. First, rw is à 
special case of ғ, resulting when subgroup mean differences are nonex- 
istent (i.e., U = V = 0). Second, ry is that special case of ғу when sub- 
group differences are eliminated statistically. The latter comparison re- 
quires the form of a first-order partial correlation rz,y,.2, where X, and 
Y, are the two variables measured in the total group and Z is a 
dichotomous variable indicating subgroup membership. In the for- 
mula for the first-order partial correlation coefficient, since 7x,z and уг 
are point-biserial correlation coefficients, the result of operating upon 
this formula is: 


Td SS(XY), — UV(nm,/N) 
; {SS(X), — U*(nin,/N) (SS(Y), — У(пл/У)) 


Since the above equation represents an alternative definition of rw. 
then r, the pooled within-subgroup correlation coefficient, is also the 
result of statistically eliminating subgroup differences from the total 
group correlation coefficient. The latter definition of r, suggests that 
r, can be meaningfully used as a descriptive statistic with a known 
sampling distribution. 


Further Derivation of rı 


If the numerator and denominator of Equation 4 are each divided 
by the product of the pooled within-subgroup standard errors of the 
mean (sz,, and 5у,), a more convenient definition of ғ; arises. TO 
complete this series of operations, by defining 1, = U/sz,, and ty = 
И/5,, as subgroup mean differences measured in units of standard er- . 
rors, the following final form results: 
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et (М — Әу, + Lt, : 

VAN = DF FIN — 2) +4) 

In this form, r, is defined in terms of total group sample size, the 
pooled within-subgroup correlation coefficient, and subgroup mean 


differences that are measured in units distributed as Student's ¢ under 
the assumptions underlying t tests on means. 


(6) 


The Decision to Pool 


The following is a simple, approximate, two-stage procedure, 
recommended as a basis for the decision to pool two subgroups in 
order to calculate a single correlation coefficient for the total group 
data. Underlying both stages is the assumption that the two subgroups 
were sampled from the same bivariate normal population. 

1. The two within-subgroup correlation coefficients should be com- 
pared. Under the hypothesis Ho: р; = р», the unit-normal 2 test, relying 
on Fisher's r-z transformation (z,), can be used: 


pun = — 25 (7) 
Үл, —3 m m 


If Но is rejected, it is unreasonable to calculate ғ as a measure of cor- 
relation for the two subgroups. (For the skull data of Pearson et al. 
(1899), H, would have been rejected (2 = 1.99, p < .05), and neither rw 
nor / would have been calculated.) If H, is not rejected, then ry can be 
considered a useful measure of correlation for the two subgroups, and 
the second stage test should be followed. 

2. In order to assess the distortion introduced by pooling the two 
subgroups, r; should be compared to rw under the hypothesis Но: p, = 
Pw. For this test, if rẹ is considered an asymptotic estimate of pw, 


Е (8) 


ин, is rejected, then r, may be considered distorted, but rw can be 
used as a measure of correlation for the two subgroups. If H, is not 
rejected, then pooling of the data from the two subgroups in order to 
calculate a total group correlation coeflicient appears to be a par- 
simonious and reasonable procedure. 


Examples of Distortion 


According to Equation 6, distortion in r, is affected by subgroup 
= tentroid differences and total group sample size. Three patterns of 
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subgroup centroid differences are of interest. (1) If the mean of 
Subgroup 2 is higher than the mean of Subgroup 1 on both variables, 
the greater the difference between the subgroups on the two variables, 
the more exaggerated the value of r, in a positive direction. (2) If the 
mean of Subgroup 2 is higher than the mean of Subgroup | on one 
variable, and equal on the other variable, the greater the difference 
between the two subgroups on the one variable, the closer r, is to the 
value zero. (3) If the mean of Subgroup 2 is higher than the mean of 
Subgroup 1 on one variable, and lower on the other variable, the 
greater the difference between the subgroups on the two variables, the 
more exaggerated the value of r, in a negative direction. Furthermore, 
for constant differences between the subgroup centroids as measured 
in standard errors, increasing the total group sample size serves to 
minimize the effects of subgroup centroid differences, i.e., м ap- 
proaches rw: 

The effects of subgroup sample size discrepancy on the calculation 
of r, and ғ, can be shown by reference to Equations 4 and 5. Ac- 
cording to Equation 5, for constant total group sample size, the larger 
the discrepancy between the subgroup sample sizes. the greater the in- 
fluence of the larger subgroup in the calculation of rw. In addition, ac- 
cording to Equation 4, for constant differences between subgroup 
centroids and for constant total group sample size, the larger the dis- 
crepancy between n, and т, the smaller the effect of subgroup centroid 
differences in the calculation of ғ, and, thus, the more equal the values 
of r, and ry. 

For the second stage test, in order to demonstrate the amount of dis- 
tortion introduced by pooling heterogeneous subgroups, Equation 6 
was employed to calculate ғ; under varying sample conditions. The 
sample conditions were derived from combinations of four 
magnitudes of subgroup centroid differences for the three patterns, 
five total group sample sizes (М), and three values of rw. The 
magnitude of difference between subgroup means can be represented 
by employing four-decimal critical values of Student's t distribution 
for p < 05, p < .01, p < .001, and p < .0001, obtained for N — 2 df 
from Sockloff and Edney’s (1972) tables. The five total group sample 
sizes were 10, 50, 100, 200, and 1000. The three values of rw were .8000, 
4000, and 0.0000, chosen to represent high, moderate, and low cor 
relations, respectively. 

Tables 1, 2, and 3 present calculated values of r, derived from the 
values of r, under the varying sample conditions. Also included in | 
these tables are the second stage two-tailed tests of Но: p: = Pw to as- 
sess the amount of distortion introduced under the conditions. 
Negative values of r are not shown in the tables since the effects of the 
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TABLE 1 
Values of г. Resulting from Three Patterns of Subgroup Centroid 
Differences and Five Total Group Sample Sizes: ғ = .8000 


Significance levels for t distribution when subgroup 


Total group mean differences equal critical values 
sample size р < .05 р< 01 p < .001 р < .0001 
Pattern 1: Y, > Х,, Ӯ, > Y,, both significant 
10 :8799 :9169 :9521 .9727* 
50 .8155 .8261 8408 .8546 
100 :8077 .8132 .8210 .8288 
200 .8039 .8066 .8107 .8148 
1000 .8008 .8013 .8022 .8030 
Pattern 2: Y, > X, significant; Y, = У, 
10 .6200 .5156 .3914 .2953 
50 .7683 .7460 7138 6822 
100 ‚7844 7732 7568 7403 
200 7923 7867 17784 ‚7699 
1000 ‚7985 7973 .7957 7940 
Pattern 3: Y, > X,, Y; < Yi, both significant 
10 .0813* 23125234 60 teo Msi EU S S 
50 .6602* .5654** .4332**** .3089**** 
100 7305 .6816** .6108*** отче 
200 ‚7653 .7405* .7040** .6672**** 
1000 7931 7881 ‚7806 .7729* 
Note.—Asterisks refer to significance levels of unit-normal z tests comparing л and ru. 
*р < .05. 
**р<.01. 
мы P< .001. 
жер < 0001. 


three patterns for negative 7/5 are opposite in sign from those shown 
for positive s, e.g., the amount of exaggeration in a positive direc- 
tion for a positive correlation under Pattern ] is equal to the amount 
of exaggeration in a negative direction for a negative correlation under 


Pattern 3. j 
As shown in Table 1, under Patterns 1 and 2, large differences 


between the subgroup means have a small effect on the value of r, 
when л, = .8000. Under both patterns, a total group sample size of 50 
appears to be sufficient to minimize the distortion introduced by sub- 
group mean differences that are significant at the .0001 level. On the 
other hand, the results were quite different under Pattern 3. When the 
means of the two subgroups are significantly different at the .0001 
level, but in opposite directions, for a total group sample size of 50 № 
was calculated to be .3089, which is significantly different from an ry 
of .8000. Furthermore, even for a total group sample size of 1000, the 
pooling of subgroups when subgroup means differ in opposite direc- 
tions at the .0001 level produced an 7; of .7729. Although this value of 


274 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


r, is significantly different from r, at the 105 level, one can argue that 
such statistically significant differences between 7, and ғ, have little 
practical significance. 1 

According to Table 2, when г„ = .4000, the results for Pattern | sug- 
gest that total group sample sizes of 50 are sufficient to avoid distor- 
tions introduced by pooling subgroups when both sets of subgroup 
means differ significantly in the same direction at the .0001 level. 
Under Pattern 2, significant distortion was not found, even for a total 
group sample size of 10. The Pattern 3 results suggest that a total 
group sample size of 200 will avoid distortion in the calculation of м. 

According to Table 3, total group sample sizes of 50 appear to be 
sufficient to avoid distortion when rw = 0.0000 and the two sets of sub- 
group means differ at the .0001 level. Based on the symmetry of the 
sampling distributions of r when p = 0, this conclusion holds for sub- 
group means differing in the same or opposite directions. 


TABLE 2 
Values of r, Resulting from Three Patterns of Subgroup Centroid 
Differences and Five Total Group Sample Sizes: rw = .4000 


Significance levels for t distribution when subgroup 


Total group mean differences equal critical values 
sample size p<.05 p< ol p<.001 р < 0001 
Pattern 1: X, > X,, 7, > Y,, both significant 
10 „6396 7508 8564 .9182** 
50 4466 4782 15223 .5637 
100 4232 4395 4631 4863 
200 4116 4198 4320 4443 
1000 4023 4040 4065 .4090 
Pattern 2: X, > X, significant; Y, = Y, 
10 3100 2578 1957 1477 
50 3842 3130 3569 E 
100 .3922 .3866 .3784 .3701 
200 .3961 .3933 .3892 .3850 
.3992 .3987 .3978 :3970 
Pattern 3: Xa > X,, Ya < Ӯ, both significant 
10 -.1590 -.4184% —.6648** —.8092*** 
50 .2913 2175 147% .0181** 
100 .3459 .3079 2529 .1987* 
200 .3730 3537 .3253 .2967 
1000 3946 .3907 .3849 .3789 


Note.—Asterisks refer to significance levels of unit-normal 2 tests compari 
paring ғ, and ry. 
*p«.05. iy eren 
** p < 01. 
*** p < 001. 
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Discussion 

The various results clearly suggest the varieties of distortion that 
may be introduced by haphazardly pooling subgroups of data for the 
purpose of calculating a single correlation coefficient. The two-stage 
test procedure should offer protection against such distortion. In 
addition, it was shown that greater latitude exists in terms of non- 
distorting pooling when the subgroup mean differences are small, the 
subgroup sample sizes are large, and the pooled within-subgroup 
correlations are low to moderate. The calculated examples suggest 
limits within which distortion does not seriously affect correlational 
results. 

The types of subgroups to which this discussion refers are those 
resulting from natural dichotomies and those resulting from an ar- 
bitrary split where (a) the decision to split the total group was based on 
considerations other than that of ridding the data of nonlinearity, and 
(Б) middle range data has been discarded. When an arbitrary split is 
made to rid the total group data of nonlinearity, the two subgroups 
тау also show evidence of different, but linear, relationships. According 
to the first stage test, if the two within-subgroup correlations are 
different, then it would appear unreasonable to even consider pooling 
the data on the basis of the original rationale for having made the split. 
On the other hand, if middle range data is not discarded when an ar- 
bitrary split is made for reasons other than ridding the data of non- 
linearity, then pooling would appear to be a reasonable step toward 
restoring the information contained within the total bivariate set of 
data. adi 

Implications of this study relate to the use of the pooled within- 
subgroup correlation coefficient and to generalizations of this study in 
terms of pooling multiple subgroups in the calculation of correlation 


TABLE 3 
Values of ғ, Resulting from One Pattern of Subgroup Centroid 
Difference and Five Total Group Sample Sizes: rw = 0.0000 


Significance levels for ¢ distribution when subgroup 
mean differences equal critical values 


Total group 
sample size p<.05 р<.01 р < .001 р < 0001 
Pattern 1: Y; > Xi, Я, > ¥;, both significant 
10 .3993 :5846 .7606* .8637** 
50 10777 .1303 2038 uh 
100 .0386 .0658 1051 joe 
200 .0193 .0330 .0533 К 
1000 0038 0066 0108 0151 
Note.—Asterisks refer to significance levels of unit-normal z tests comparing 7, and rw. 
*p« 05. 


** p« 01. 
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matrices. Assuming no difference between the within-subgroup cor- 
relation coefficients (non-rejection of the first stage test), the pooled 
within-subgroup correlation coefficient is useful as a descriptive 
statistic with hypothesis-testing capabilities resulting from its 
equivalence to a first-order partial correlation coefficient. In research 
involving multiple dependent variables that are analyzed via several t 
tests comparing means, rather than a one-way two cell multiple 
analysis of variance, intercorrelations among the dependent variables 
should be assessed through pooled within-subgroup correlation соећ- 
cients rather than total group correlation coefficients. Otherwise, the 
meaningfulness of the intercorrelations would be contingent upon the 
failure to find cell differences in all of the t tests. 

Considering the demonstrated varieties of possible distortion of a 
single correlation coefficient from the haphazard pooling of only two 
heterogeneous subgroups, the generalizations of these results must be 
inherently more complex, i.e., the effects of pooling multiple sub- 
groups on correlation matrices. If, indeed, such complex distortion 
can be demonstrated, and it is desirable to pool data for reasons of 
parsimony, this suggests that further study should be devoted to the 
multivariate case and the development of appropriate test procedures. 
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THE r-POINT BISERIAL LIMITATION 


R. A. KARABINUS 
Northern Illinois University 


A study was made of the r-point biserial coefficient using four non- 
normal distributions for the continuous variable: rectangular, 
bimodal-normal, bimodal-peaked, and bimodal-peaked and skewed. 
Ns of 10,30, and 100 were used. It was argued that linearity was the 
main assumption required when using Pearson correlations and that 
the usual maximum r-point biserial of .798 could be exceeded when 
the shape of the continuous variable more nearly approached that of 
the dichotomized variable. Correlations were found over .80 with 
rectangular distributions, and over .90 with bimodal-peaked dis- 
tributions. 


IT is generally accepted by many practitioners of educational 
statistical techniques that the r-point biserial correlation coefficient 
has a limitation of about .80. This is the limitation that most 
educational statistics books cite, and it is based on the assumption that 
the Y variable (e.g., total test scores in an item analysis) is normally 
distributed and the X variable (e.g., scores on dichotomous items, | or 
0) is a true dichotomy, split in such a way that p = g = .5. Since the r- 
Point biserial is a member of the Pearson product moment correlation 
family, the other assumptions of linearity and homoscedasticity are 
also mentioned. However, a different interpretation of the normality 
assumption was taken by Walker and Lev (1953), who indicated that 
the assumption of normality is applicable for each set of И scores 
Paired with each of the Х values. This would give a bimodal distribu- 
tion on the Y variable. Since it is possible to obtain correlations above 
80 with a bimodal distribution on the Y variable (with the X variable 
Split so that p = q = .5), the question of the traditionally accepted 
limitation and assumptions of the r-point biserial needs to be 
reopened, 
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Even though Nefzger and Drasgow (1957) made a strong case for 
not needing to meet the assumptions of normality when using any of 
the Pearson product moment correlations, most authors of beginning 
statistics books still cite the triumvirate: linearity, normality, and 
homoscedasticity. If, indeed, linearity is the chief assumption to be 
met with the Pearson coefficients, as suggested by Nefzger and 
Drasgow, then the other two conditions may accompany linearity but 
they are not necessary conditions. It is quite possible to have both 
linearity and homoscedasticity without having normal distributions, 
as long as the two variables have the same shape (both in skewness and 
kurtosis). This can be easily demonstrated by the interested reader by 
plotting a scattergram of two negatively (or positively) skewed 
variables that are moderately to highly related. 

Applying these assumptions to the fourfold point correlation соећ- 
cient (phi), another member of the Pearson family, raises similar con- 
cerns. Both Glass and Stanley (1970) and McNemar (1962) explained 
the need to have similar distributions on the dichotomous variables in 
order to obtain the best estimates of correlations. Linearity was the 
main concern, and the authors showed that when linearity is violated 
by having disproportionate ratios of 175 and 075 for each variable, there 
is a tenuation effect on the correlation coefficient. 

If normality is not a requirement for the phi coefficient, then why 
should it be for the r-point biserial? To ensure linearity, however, it is 
necessary to have the same shape for each variable, both in skewness 
and kurtosis, especially the former. In light of this, the usually cited 
limitation of .80 for the point biserial can be understood. One 
variable is continuous and normal and the other dichotomous and 
symmetrical (if p = q). When the dichotomous variable is split at the 
median, it is as “normal” as it can be, and the highest possible coeffi- 
cient that can be obtained is .798. 

If it is appropriate to use the Pearson coefficients when the variables 
are both linear and similar in shape, as suggested above, then if the 
shape of the continuous variable were made more similar to that of the 
dichotomous variable, the .798 limitation should not hold. Bowers 
(1972) found coefficients of .866 for rectangular distributions of the 
continuous variable with equal non-overlapping distributions for the 
dichotomous variable. He also found coefficients up to .849 when the 
continuous variable was skewed and the dichotomous variable split at 
6 and 4. In an earlier study, Adams (1960) reported a r-point biserial 
of .839 with a platykurtic distribution using an n of 512 and an equal 
split on the dichotomous variable. 

It is clear to the author that the main reason for the limiting value of 
the r-point biserial is the lack of similarity of distribution in the two 
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variables. While it is recognized that it is impossible to obtain a perfect 
correlation with the r-point biserial (this can occur only with two con- 
tinuous variables or two dichotomous variables), it is possible to more 
nearly approach +1.00 by making the shape of the continuous 
variable more like the shape of the dichotomous one. This can be done 
by having a bimodal but symmetrical distribution on the continuous 
variable, which for each part of the dichotomous variable would be as 
peaked as possible. 

In this study, r-point biserial coefficients were calculated using five 
different shapes of the continuous variable for comparative purposes: 


Normal: distributions were less than perfectly normal because of 
the number of scale units used (see Table 2) and the size of n. 
No fractional frequencies were used. 

Rectangular: perfectly flat distributions. The coefficients were a 
function of the number of scale units used. 

Bimodal-normal: the distribution of the continuous variable was 
made as normal as possible for each of the dichotomized values 
without using fractional frequencies. 

Bimodal-peaked: for each dichotomized value, the distribution of 
the continuous variable remained symmetrical but was made as 
peaked (leptokurtic) as possible. 

Bimodal-peaked and skewed: same as above (bimodal-peaked) ex- 
cept each of the distributions was highly negatively skewed. 


Three different n’s were chosen, 10, 30, and 100 (with slight ad- 
Justment for the rectangular distribution), with two different p values, 
5 and .6. One set of coefficients was calculated when there was no 
Overlap of the continuous variable from one of the dichotomized 
values to the other, and another set with slight overlap (See Table 2 
for the actual frequencies and overlap used). Table | gives the re- 
sulting r-point biserial coefficients under the described circum- 
Stances, and Table 2 gives the actual distributions used over each 
of the dichotomized values. 

The calculated coefficients shown in Table 1 clearly indicate that 
"Point biserial values can be found above .798 when the shape of 
the Continuous variable is more like that of the dichotomized one. 
(Those Coefficients 7.798 under the “Normal distribution" occurred 
because the distributions were not perfectly normal.) In practical 
Situations, researchers seldom will replicate the identical conditions as- 
Sumed in this study, but they may approach them. For example, 
Platykurtic and occasionally bimodal distributions do occur in item 
LPalyses of teacher-made tests. Therefore, it might be helpful to 

now that the r-point biserial, as one of the Pearson product mo- 
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ment correlation coefficients, has no limitation that is not present 
in any of the other members of the Pearson family (except for the 
natural limitation of the variables themselves). While it is not pos- 
sible to obtain the perfect +1.00, one can approach it, e.g., 978 
with an n of 100, with a bimodal-peaked distribution on the con- 
tinuous variable and no overlap on the dichotomized variable. 

In summary, with n’s > 30 and rectangular distributions of the 
continuous variable having little or no overlap on each of the 
dichotomized values, coefficients above .80 can be expected. Similarly, 
with bimodal distributions having little or no overlap on each of the 
dichotomized values, coefficients of .90 or above can be expected. 
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ON SPLITTING THE TAILS UNEQUALLY: 
A NEW PERSPECTIVE ON ONE-VERSUS 
TWO-TAILED TESTS 


SANFORD L. BRAVER' 
Arizona State University 


The controversy that has raged since the early fifties regarding the 
admissibility of one-tailed tests of hypotheses was examined. From 
the review of that literature, it was concluded that the main advan- 
tage of the one-tailed test was the gain in power for the prediction 
while its main disadvantage was its inability to test for significance if 
the results were opposite to prediction. It is argued here that splitting 
@ unequally between the two tails, placing most of the rejection 
region on the side of the prediction but a smaller fraction on the op- 
posite side provides both power and the ability to detect opposite-to- 
prediction outcomes. This compromise procedure requires а finer 
choice in the splitting of а than the dichotomous choice of putting 
either all or exactly half of a in the favored tail, i.e., the choice 
between a one- or a two-tailed test. Rules for the most effective split, 
based on Bayesian considerations, are prescribed. The fraction of a 
in the predicted tail should be equal to the investigator's a priori 
probability that the predicted order, as opposed to the reversed 
order, of sample means will be obtained. A table of r-values is 
presented which gives critical regions for significance, both "ex- 
pected" and "unexpected," at specified levels of a priori probability. 


debates regarding the propriety of one- 


SiNCE the early fifties lengthy 
e appeared in the literature. 


versus two-tailed tests of hypotheses hav 
The sole area of agreement appears to be that the two-tailed test is ap- 
propriate when the investigator is concerned merely with the presence 
or absence of the effect of some two-level independent variable on the 
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dependent variable," without being concerned, or hypothesizing in ad- 
vance, as to the direction of difference. In this test, the null is that the 
two population means are identical (Ho: ш, — и: = 0), or more 
precisely, that the two samples are drawn from a single population. An 
improbably large difference in the two sample means, regardless of 
which is the greater, is taken as evidence opposed to this null and in 
favor of its alternative, Ha: pı — u, ® 0. In order to assure that the 
probability of a Type I error (rejecting a true null hypothesis) is held at 
some fixed level, a, we divide a in half. Thus, .5a proportion of the 
area on either extreme (or tail) of the sampling distribution of mean 
differences is taken as the “rejection of null" region. 

Many authorities (Marks, 1951, 1953; Jones, 1952, 1954) argued 
that a different kind of logic is applicable when the investigator 
hypothesizes in advance which of the two means should be the larger, 
by far the most usual occurrence in psychological research. Here, the 
null is termed directional, and takes the form Ho : u, — u, < 0, while 
the alternative becomes Ha: u, — us > 0. Again only evidence in favor 
of the alternative hypothesis, that is finding that the obtained order of 
means matches the order predicted by Ha, can be used to reject the 
null. Thus, the entire о, rather than half of it, is placed on the side of 
the null-generated sampling distribution corresponding to the order- 
ing of means predicted by Ha, and becomes the rejection region. This 
one-tailed test procedure has the effect of requiring less extreme mean 
differences in the direction predicted than does the two-tailed test to 
reject the null with probability а of a Type I error. The latter is refer- 
red to as a gain in power. 

The advice to use one-tailed tests is not without severe critics, 
however. In particular, Hick (1952) and Burke (1953, 1954) have 
joined the fray opposing this use. Their most compelling point is that 
this procedure prevents the investigator from evaluating for signifi- 
cance differences in the unexpected direction, no matter how large. 
Inasmuch as the insights to be gained from clear-cut results which are 
counter-intuitive or counter-theoretical sometimes exceed in impor- 


tance those from findings that are consistent with predictions, this 
short-coming is crucial. 


* The argument being developed is cast in terms of mean differences or t-tests, but, of 
course, is equally applicable to any potentially two-directional test. Thus the argument 
for tests of central tendency using non-parametric indices (e.g., Mann-Whitney U, me- 
dian split, etc.) tests of correlation coefficients different from zero, tests of differences in 
correlation coefficients, tests of chi-square with two levels on the predictor variables, 
and even F-tests in analysis of variance would all be exactly analogous. In the latter case, 
ina factorial (or one-way) design, any | df test or contrast may be converted to one sen- 
sitive to direction by square-rooting the obtained F, which is then equivalent to а t- 
statistic with df equal to the df of the F denominator. 
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Kimmel (1957) took a slightly more moderate position. He wished 
to "limit the use of one-tailed tests to . . . infrequent situations" (p. 
353) rather than to outlaw them entirely. In developing his criteria for 
their authorized use, however, he is clearly in agreement with Hick and 
Burke that most unexpected findings must be testable for significance. 
Thus these criteria essentially require the investigator to perform a 
two-tailed test except when differences in the opposite direction are 
impossible or meaningless. ( 

Hence, the main disadvantage of the one-tailed test, as compared to 
the two-tailed, its inability to deal responsibly with results in the other 
direction from that hypothesized, is seen by its critics as fatal. 
However, the main advantage of the one-tailed test, additional power, 
is seen by its proponents as overriding this disadvantage. Fisher, in 
defining power as the probability of detecting true population 
differences, used the term synonymously with "precision." Further- 
more, Overall (1969) proved that increasing power has the additional 
advantage of decreasing the proportion of significant results which are 
due to Type I error. Despite its obvious importance, power is a con- 
sideration which has been too often either neglected or misunder- 
stood. Cohen (1962) has shown that the level of power attained by 
studies in the social-personality area is unacceptably low. And even 
sophisticated researchers have misguided faith in the ability of well- 
planned studies to surmount deficiencies of power, as Tversky and 
Kahneman (1971) have demonstrated. j 

The present argument accepts the rationality inherent in both posi- 
tions and urges a compromise. Rather than taking a stance with regard 
to whether the one- or the two-tailed test is the most seriously flawed, a 
procedure is developed which can capitalize on the advantages of each. 

The procedure that can mediate this compromise is the unequal 
splitting of a between the two tails; for example, .8a сап be placed in 
the tail which represents the predicted direction, and .2 in the other, 
By splitting œ unequally, the investigator is not forced to make a 
binary choice between the traditional two- or traditional one-tailed 
tests. Indeed, it is difficult to formulate a rationale for why these 
should be the only options available. By splitting а unequally the in- 
vestigator is instead free to choose from a continuous range of pos- 
sibilities that split of a which is optimum for his situation. This range 
extends from (and includes) .5a in each tail, to all of a in the favored 
tail, thus including the traditional tests as the boundary cases. In mak- 
ing this more flexible choice, the investigator realizes simultaneously 
the advantages of both the traditional tests. The researcher gains 
Power on the side where he wants it, but he also retains the ability to 
detect unexpected results, The latter is purchased at the cost of some of 
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the former, but the investigator decides in advance the best balance for 
his particular research. 

By dividing а а priori into two unequal parts and assigning the 
greater fraction to the expected tail, the investigator avoids a difficult 
and embarrassing dilemma that frequently arises. This dilemma occurs 
when he is faced with unexpected results which would have been 
significant by a traditional two-tailed test, after having decided upon a 
one-tailed test. Goldfried (1959) noted that the investigator in this 
situation has available to him three options: (1) He can ignore the 
result by considering it part of the null hypothesis of no difference; (2) 
he can test the significance of the unexpected result with the same data, 
but this kind of "cheating" has the effect of inflating а; or (3) he can 
gather new data, testing the reliability of the unexpected finding with a 
reversed one-tailed test. АШ of these options have some obvious disad- 
vantages. The proposed procedure minimizes or avoids these disad- 
vantages. The investigator states in advance which region of the unex- 
pected tail will constitute results he will consider synonymous with the 
conclusion of “no difference except that due to sampling fluctuation,” 
and which will fit with the conclusion of “results significantly opposite 
to expected.” Secondly, the two rejection regions, since they sum to 0, 
keep o constant at its preordained level. Finally, both hypotheses—the 
Pii quen one and the unexpected one—can be tested with the same 

ata. 


Deciding on Fractions 


If the unequal splitting of а is to be a viable option for the 

researcher, it is necessary to develop a rational basis for deciding upon 
the split of a into fractions. 
à The first consideration is that the choice should obviously be made 
in advance of seeing the data. Though Hick (1952) argued that “it 
makes no difference when а theory or hypothesis is conceived . . - 
Logic is timeless,” (p. 316), Marks (1953) correctly responds to this by 
noting that how it is conceived—from inspection of the set of data be- 
ing analyzed or from theoretical considerations—is, however, of 
statistical importance. 


з A clever approach to directional inference, suggested by Shaffer (1972) is to test 
simultaneously two sets of nulls and alternatives: 1Н0: p, > из VS. 1Ha: ш < Ha, а" 
2Ho: ш < us vs. 2Ha: ш. > из. И the investigator tests each null at .5a, the maximum 
probability of at least one false rejection is a. This approach, too, is amenable to the sug- 
gestions made here. If the experimental prediction favored by the investigator is 1Ha, he 
should inflate the о for the test of 1Ho, and correspondingly decrease it for the test 0 
2Но, such that the sum of the two a’s equal the predetermined overall а. 


Le wp 
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The second issue concerns the belief structure or conclusions drawn 
by the investigator, which, of course, materially affect his subsequent 
research activities and thus determine the quantity and nature of 
evidence that will be obtained in the future. With the present two- 
tailed test, a statistically significant finding that is directionally op- 
posite to that which was theoretically predicted, does not appear to 
have the same implications—in terms of these conclusions or future 
research activities—as “expected significance.” It is true that opposite- 
to-predicted significant results are likely to weaken the investigator's 
belief in his hypothesis while expected significant results will 
strengthen this belief. However, it is also true that the former is un- 
likely to strengthen his belief in the opposite hypothesis as much as the 
latter will strengthen his belief in the proposed hypothesis. This can be 
demonstrated by observing that investigators more often attempt rep- 
lication of a study when the results are significant in the opposite direc- 
tion than when they are significant in the direction predicted. What is 
empirically the case corresponds to what statistical decision (і.е., 
Bayesian) theory prescribes as optimum for the decision-making 
researcher. 

The recommended flexible division of а would enable us to deal 
with the asymmetry of the impact of expected vs. unexpected signifi- 
cance. The split of а should be arranged in such a way that the in- 
vestigator will be as convinced by “unexpected significance" as by “ех- 
pected significance." That is, the split must be arranged so that ifa test 
statistic just falls in the smaller of the rejection regions the in- 
vestigator's belief in the validity of the reversed hypothesis should 
equal his belief in the predicted hypothesis if the test statistic had just 
fallen into the larger of the rejection regions. To clarify, and following 
Kaiser (1960), we divide the usual two hypotheses, null and alter- 
native, into three which are again mutually exclusive and exhaustive; 
Hy: шщ, < ua; Нә: u, = из and Hsi ш > из. Suppose that Hy is the 
hypothesis favored by the theory, and the investigator had decided 
upon the classical two-tailed test, i.e., each tail contains .5a. It is un- 
likely that a ¢-value which just falls into the H, rejection area will con- 
firm the investigator’s belief in the validity of H, to the extent that a t- 
value which barely crosses the Hs rejection region does so for Hy. Heis 
much more likely to attempt to explain the unexpected, Н. consistent, 
results than the expected, H,-consistent, results as chance deviation 
due to sampling (i.e., Туре I) or measurement error. In contrast, he 
Will typically accept the latter as confirmation. If one simply will not 
be convinced by an opposite-to-predicted result without further 
replication, he is in essence throwing away up to half of his a if he 
places it in that tail. In terms of the present argument for keeping the 
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implications of achieving significance consistent for the researcher's 
conclusions, the 50-50 split is unwise. However, some other split ofa 
can be found where the investigator's belief changes identically for 
results in either rejection region. The task is to adjust the size of these 
two rejection regions in such a way as to equalize the investigator's 
temptation to regard results in either of these regions as arising from 
Type 1 error. This сап be done by entertaining as possibilities 
various splits and with each asking the question: “If the t-value falls 
into the Н, rejection area will my conviction in the validity of Hı be 
equal to my conviction in the validity of Hs if the value falls into its re- 
jection region?" If so, the correct split has been made; if not continual 
readjustments of the split are made until the answer is affirmative. 

While the above procedure is tedious, another can be found which 
will have the same desired affect, i.e., of producing identical changes 
in belief for results in either rejection region. This more convenient 
procedure makes use of a version of Bayes’ formula to the effect that 
proportionally greater evidence should be required to convince the 
rational decision-maker of the validity of a proposition he considers 
improbable than one which has high a priori probability. Thus if the 
a priori probability of Н» is four times that of Ну, the evidence for H, in 
order to accept Н, should be required to be four times as compelling as 
the evidence for И» in order to accept Н,. Thus the investigator should 
compare Нз to H, fora priori probability.* The fraction of a placed in 
the larger tail should equal the a priori probability of Hs. 

Yet another mathematical relationship can be utilized to make the 
procedure still more concrete. The investigator’s a priori probability of 
H, should be equivalent to his judgment of the likelihood that the sam- 
ple means will be in the order predicted by the theory rather than the 
reverse direction, ignoring for the moment the issue of significance. If, 
for example, the researcher believes, on the basis of previous ac 
cumulated evidence and all else that he knows of the situation, that x 
is four times as likely to exceed X; as the reverse, i.e., his personal 
probability that Y; > X; equals .8, he should split a so that .8а lies in 
the H; tail, .2« in the Н, tail. 

To ascertain whether .8 actually is his a priori probability he should 
ask himself whether Ве would be willing to accept either side of à 
money bet that X, will exceed Xz, as opposed to X, < Xs, at4tol odds. 
The basic rule becomes: the proportion of ос allocated to the favored 

“This assumes that the a priori probability of Н; (that p, equals exactly uz) is zero, à 
realistic assumption. See Bakan (1966), Meehl (1967). 

5 To offer a formula, and an additional example if the investigator would give A to B 
odds, where A is the larger, since his a priori probability is A/(A + B), he should put 


(АДА + В) a in the larger (ай, (B/(A + В)) а, in the smaller. With odds of 3 to 2 
A = 3, B = 2 and probability = .6. 


= 
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tail should equal the investigator's a priori probability that the sample 
means will be ordered as predicted, for then he will be equally tempted 
to regard results in either rejection region as arising from Type | error. 

Traditional one- and two-tailed tests emerge as special cases of this 
rule. A one-tailed test is performed when the a priori probability of 
reversal of means is 0, that is, when it is seen as impossible. (cf. Kim- 
mel, 1957). A two-tailed test (50-50 split) is performed when the 
researcher has no a priori evidence which leads him to expect that X, is 
more likely to exceed Х, than the reverse. 

Table 1 presents critical values for the t-statistic, with a = .05, in 
which a is split in accordance with the a priori probability of the 
predicted direction of sample means. Two values are given for each 
combination of df with a priori probability. The “+” value corre- 
sponds to the t-value needed to reject the H, null in favor of the 
predicted direction, while the “-” value presents the value necessary 
to reject the Н» null in favor of the reversed direction hypothesis. The 
conditional probability, given Н», that t will exceed the “+” value 
or be less than the “-” value, equals .05. When the a priori probabil- 
ity is 1.0, the “+” values correspond to those needed for traditional 
one-tailed tests, while the “-” values are infinite. When the a priori 
probability is .5, the values for both “+” and “—” outcomes are iden- 
tical to two-tailed test values. As another example using the case given 
earlier, where the odds were 4 to 1, the a priori probability would 
equal .8. Thus the “+” value corresponds to .8a or .04, while the “—” 
value is for .2« or .01. Given df of, for example, 21, if the sample 
means are ordered as predicted, an obtained value for t would have to 
equal or exceed 1.840 to claim "expected significance." The critical 
value for “unexpected significance,” for the case when the order is 
Teversed, is 2.518. By adopting these critical values, expected and un- 
expected significance should have equal implications for the 
Tesearcher's a posteriori beliefs. 


Discussion 


‘Splitting the tails unequally on the basis of the a priori probability 
that the predicted direction will obtain for the sample means calls into 
focus a point which remained obscured utilizing the traditional 
Procedures. When a prediction of an experimental outcome 15 made, it 
follows from a combination of intuitive notions, informal observa- 
tions, empirically untested theory, empirically tested theory, previous 
tangential research with varying outcomes, previous research with the 
Present paradigm with varying outcomes, etc. Another consideration is 
the skill of the investigator in operationalizing his concepts and provid- 
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ing a fair test. However, no two experimental predictions combine 
these ingredients in precisely the same way. Some predictions are made 
with extreme confidence and vigor, and others with great timidity. To 
permit a choice only between a one- and a two-tailed test is to disallow 
finer distinctions in the strength or firmness of predictions made in 
different studies. The continuous choice of splits is superior, not only 
because of salutary effects on the conduct of research, but also because 
it allows the researcher to summarize in a single index, i.e., the a priori 
probability statement, all the considerations which led him to his pre- 
diction. While an objective formula for combining the considerations 
into a probability statement would, of course, be preferable to the sub- 
jective operation advocated here, it is unfortunately unrealistic. 

It should be emphasized, however, that the personal probability 
statement should be a public one which the researcher can defend by 
reference to previous literature, etc. Indeed, the typical introductory 
section of most research reports is a listing of the considerations which 
led the research to his prediction, and could readily and logically 
culminate in the statement of the a priori probability used in the test of 
the hypothesis. Readers (or editors) whose personal probability 
differed from the investigator's would be thus free to draw different 
implications from the evidence. 

One apparent drawback to the proposed procedure is its 
vulnerability to abuse. Suppose an investigator, in advance of seeing 
the data, decided upon a 60-40 split, i.e., his subjective probability was 
6 that X, would exceed Х,. Assume, with df = 16, his obtained r-value 
was +1.90. From the table he observes that he has not obtained signifi- 
сапсе in the expected direction. If һе had instead chosen .8 as his а 
priori probability it is clear that significance would have been achicved. 
What is to prevent him from being seduced into discarding his earlier 
split and deciding that confidence of .8 had indeed been warranted? On 
reflection it can be recognized that the same argument applies with 
equal force to Switching from a two- to a one-tailed test to achieve 
significance. As in the latter we must rely on the integrity of the in- 
vestigator. 

There is no doubt that the compromise procedure of splitting the 
tails unequally will offend some traditionalists, especially hardcore ad- 
vocates of two-tailed tests. Clearly this compromise is preferable to the 
overuse of the one-tailed test. Nonetheless, a basic objection might be 
that such a procedure formalizes and thus sanctions the custom of giv- 
ing less credence to disconfirming than confirming evidence, a practice 
which could be viewed as scientifically corrupt. It is perhaps a suffi- 
cient defense of this compromise procedure to argue that the above 
characterizes the way all-too-human investigators presently act 
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anyhow, however much we might prefer some more ideal behavior. 
Science progresses by what scientists do, not by what we would like 
them to do. The procedure here advocated provides a set of opera- 
tions, and the rationale, to do what is already done, but in a far more 
systematic and orderly fashion. 

A stronger and less cynical defense of the procedure, however, can 
be offered. We have been referring to the descriptive aspects of 
statistical decision theory: its (accurate) predictions about how scien- 
tists will draw conclusions given evidence and prior probability. 
However statistical decision theory is prescriptive as well as descrip- 
tive. In addition to making predictions, it indicates how individuals 
facing uncertainty about true states of affairs should decide in order to 
maximize outcomes and minimize errors. According to Bayesian for- 
mulations, it is eminently rational to require more compelling evidence 
for low probability hypotheses than for those with high probability. In 
this way, the decision-maker is formally entitled (indeed required) to 
make use of previous knowledge, in drawing present conclusions. 
With classical procedures, by contrast, every conclusion is drawn as if 
in a vacuum, isolated from findings that have come before. It is true 
that the investigator typically endeavors to integrate his findings with 
Prior evidence, but this process is informal, divorced from the 
Statistical testing of hypotheses, rather than an integral part of it, as 
Bayesians would prefer. 

Clearly, scientists are in the business of deciding about (rather than 
determining unequivocally) the validity of theories and hypotheses 
from accumulated evidence. Yet, as decision-makers, we have not 
taken advantage of the advances being made by mathematicians 
Studying statistical decision theory. The proposed procedure incor- 
Porates their formulations and thus places our decisions on a more 
rational basis than the present. ; 

Ап additional consideration is the unequal stature given by 
traditional procedures to Type I versus Type II (failing to reject а false 
null) errors, There is little philosophical basis to the argument that to 
embrace false statements is a more grievous error than failing to rec- 
Ognize the validity of true statements. While splitting the tails unequal- 
ly would not go as far to restore the balance as the Bayesians might 
Prefer, (they would wish us to specify quantitatively a different loss 
function for each problem) it is preferable on this dimension to clas- 
Sical procedures. 

The procedure described here can be seen as a compromise not only 
between the use of the one- and the two-tailed tests. It is also a com- 
Promise between those who wish to maintain statements of signifi- 
cance and those who wish to abolish them. Bayesians, such as 
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Edwards, Lindman and Savage (1963) contend that beliefs change 
with each new datum, nonsignificant as well as significant. Similarly, 
Eysenck (1960) argues that the p-value for the test statistic has the last- 
ing importance; whether or not p exceeds some arbitrary (.05) value 
and is called "significant" is irrelevant and has undesirable effects on 
the conduct of research. While the author has sympathy with these 
views, it is evident that their orientation has won few adherants among 
psychologists actively engaged in research. Manuscripts which contain 
statements of whether or not a finding is significant, without exact p- 
values, remain de riguer for most researchers and editors. Perhaps, the 
proposed procedure, which retains the arbitrary a level overall, but 
splits it between the two tails in accordance with a priori probability 
values, will be viewed as a realistic compromise. 
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A NON-PARAMETRIC TEST FOR INCREASING TREND 
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In this article, a new test for increasing or decreasing trend in a set 
of observations is presented. Rather than making specific paired 
comparisons, as in the Cox and Stuart tests for trend, the user of the 
new test makes comparisons of all possible pairs of observations. 
The advantage of this procedure is that, for a given number of obser- 
vations, many paired-comparisons are made. This allows the 
researcher to examine increasing or decreasing trend when there is 
quite a small number of observations although, admittedly, the 
power of the test is low in such cases. A normal approximation is 
available for use with large samples. 


PROBABLY the best known non-parametric tests for increasing or 
decreasing trend are those developed by Cox and Stuart. Of these, 
two are based on the binomial distribution. The other, which is for use 
with large samples only, used the normal deviate as the test statistic. 

In each of the Cox-Stuart tests, the set of scores being tested for 
increasing or decreasing trend is divided into segments, correspond- 
ing scores in two of these segments being compared for relative 
magnitude. For the 5; test, the first half and the second half of the set 
Of scores form the segments of interest, the middle score being drop- 
рей if there is an odd number of scores. For М scores, therefore, there 
are N/2 comparisons at most. The S; test uses the first and last thirds 
Of the set of scores аз the segments of interest, mid-scores being drop- 
Ped (or dummy ones added) to make the effective number of scores 
equal to a multiple of 3. For this test, there are approximately N/3 
Comparisons. 

The first and last halves of the set of scores, which are used for the 5; 
test, are again used for the S, test. For the latter test, however, the ele- 
Ments of the second segment are reversed in order. 
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Each of the Cox-Stuart tests thus involves the comparison of a series 
of pairs of corresponding scores. For each such pair, an indicator 
variable takes the value of unity if the later-occuring score is greater 
than the earlier-occurring, or zero if the opposite is true. The sum of 
the indicator values is the test statistic for the S; and the S; tests. The 
sum of weighted indicator values is the test statistic for the S; test. 

The purpose of the present article is to describe a new non- 
parametric test for increasing or decreasing trend. In this test, each 
score in a series is compared with each score which follows it. For a 
sample of N scores, therefore, (3) comparisons are made. In general, 
this is a considerably larger number of comparisons than would be 
made with any of the Cox-Stuart tests—a particularly important 
aspect when N is small. Since we are concerned with whether one score 
is greater than another, and not with the actual difference between 
them, scores may be discarded in favour of ranks. 

Given the set of scores {X;, X2, --- , Ху), the problem is to test the 
null hypothesis that there is no monotonic increase in the set against 


the alternative hypothesis that monotonic increase exists. The ranks of 


the scores will be represented by {r} = (ri Fa, s , Fi, Tp oo Им 
Each pair of scores (Х;, Xj) is compared and an indicator variable ai; 
takes the values 


а= if r—r,«0 


af =O or, —г 40 
It is observed that the inclusion of tied observations іп the category 
for which a, = 0 will make the test more conservative. For each pair 
of scores, аи, is a Bernoulli variable. The quantity X:a;j, summation be- 
ing taken over all pairs, is thus binomially distributed, the parameters 
being (3) and %. The test involves the determination of whether 
the probability of the obtained value of S = Хау, and of all values 


which are more extreme, is greater than а, the selected level of sig- 
nificance. 


Numerical Example 


Is the set of scores (11, 9, 14, 15, 13, 20) monotonically increasing? 
The corresponding set of ranks is (2, 1, 4, 5, 3, 6). Calculation of the 
values of the indicator variable a;; is performed below: 
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пп ay 


MEPWWWNNNN = = e e 
QV Ov л UO — Q \л & ш ON Cn кы 
| 
N 


The sum of the values taken by the indicator variable is 
s= У а;; 


= 12 


| The relevant binomial distribution is B(15, №), the first few terms of 


_ Which are 
1 
5 

40 n 

15 ооо 

14 

eu 0.017 0.059 

13 0.003 
12 0.014) 
| 11 0.042 50:53 hA: а 
. 10 0.092 
| 


2 Тһе probability of 5 = 12, or of any 5 having a more extreme value, 
18 0.017. The null hypothesis is therefore rejected at the 5% level of 
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significance. The set of ranks is thus still sufficiently close to perfect 
ascending order to be regarded as monotonically increasing. 


Large Samples 


The expectation of a random variable X having an associated 
binomial probability distribution В(п, p) is np; the variance of X is 
пр(1 — p). As the value of n increases, the distribution B(n, р) ap- 
proaches the normal distribution with mean E(X) and variance var(X). 

In cases where large numbers of pairs of scores are considered, 
therefore, a normal deviate may be used as the test statistic. Assuming, 
for illustrative purposes, that our numerical example constitutes а 
“large sample," the normal deviate is 


„5-89 
vvar (S) 

12 — (15) (3) 

(15) (9%) 

= 2.32 


This value of 2 is greater than 1.64, and is thus significant at the 596 
level. 


Conclusion 


In conclusion, it is admitted that the comparison of large numbers 
of paired ranks is more tedious and time-consuming than the ex- 
amination of only a few pairs. However, in an era when computers are 
readily available to most people who need them, this is not a serious 
drawback. On the other hand, the number of paired ranks greatly ex- 
ceeds the number of observations. This not only allows us to place 
greater confidence in the data, there being more of them; it also allows 
us to use the test in cases when М is very small. 
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THE VARIANCES OF EMPIRICALLY DERIVED 
OPTION SCORING WEIGHTS’ 


GARY ECHTERNACHT 
Educational Testing Service 


Estimates for the variances of empirically determined scoring 
weights are given. It is also shown that test item writers should write 
distractors that discriminate on the criterion variable when this type 
of scoring is used. 


IN recent years, the developers of large-scale testing operations have 
shown an increasing interest in reducing the length of time examinees 
are required to spend on a given test. Reducing the test administration 
time would both reduce the cost of developing the test forms, as fewer 
items would be required, and allow time for additional tests to be ad- 
ministered. This thinking has characterized many of the test programs 
administered at Educational Testing Service, and, most likely, at other 
testing establishments. Researchers have thus sought new scoring 
methods that would result in increases in reliability due solely to the 
Scoring system used. Thus, test length could be reduced, and a 
previous standard of reliability maintained. IU 

One such scoring method that has proven successful in reliability 
studies is that of empirically deriving scoring weights (Davis and Fifer, 
1959: Echternacht, 1973; Hendrickson, 1971; Reilly and Jackson, 
1972; Strong, 1943). If empirically derived scoring weights were to be 
adopted by such large-scale testing programs as the College Entrance 
Examination Board, the Graduate Record Examinations, the Law 
School Admission Test, and other programs, one problem that would 


have to be faced is that of determining the variances of the derived 


Weights and the implications these variances have for developing test 


——— 
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items. This is necessary on repeated occasions, and the scoring weights 
would only be developed on the initial administration. Since some ex- 
aminees would not be included in the initial scoring run, the problem 
of scoring weight variance exists. Also, by knowing this variance, the 
minimum number of examinees needed to develop the weights, subject 
to a specified level of precision, can be determined. 

There are a number of methods that can be used for deriving the 
weights. The method that will be discussed here is that used by Echter- 
nacht (1973), which is actually the method used by Reilly and Jackson 
(1972) with no iterations. Briefly, the method consists of assigning the 
average criterion score of those selecting a given option. The criterion 
variable is standardized, so that its mean is zero and variance is one. 
The criterion that is usually used is the score on the remaining items 
that make up the test although this is certainly not a necessary 
criterion. 

Consider a population of N people who will take a given test at one 
point іп time. Assume further that a simple random sample of n people 
from the population take the test for the purpose of determining scor- 
ing weights. Although this is not exactly true in an operational setting, 
it does provide a useful approximation to reality. Consider one item 
for that test. The scoring weight assigned to the ith option of this item 
is 


Я. = È уа/тш 


ізі 
where л; represents the number of people responding with the ith op- 
tion and y; represents the criterion score for the jth person choosing 
the ith option. In weighting options, the omit category is considered 
another option and a weight is also derived. Since the criterion 
variable is assumed to be standardized, 


У nig: /n = j. =0 


ізі 


where 


n= Ут, 
ізі 
the number of people responding to the item with one of the c possible 
options. Using the standard result for the variance of a mean obtained 
by simple random sampling from a finite population, the variance of 
the ith option weight thus becomes 


(т = VN) S? 


n 
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where 


S? = х И, ЮУ 


М, indicates the number i examinees in the рерин responding 
with option i. The problem becomes one of estimating S". This is done 
by using the unbiased estimate 


СЫ 


mitem —14 5 (y; — d) 
Such estimates of 8,2 would presumably be obtained through pretest- 
ing of the item. 

Suppose the whole population of N examinees is used for the pur- 
pose of determining scoring weights, and the method previously 
described is used. 

Now, 


m ;-Ff)y-21 ad Y.-0 


where c indicates the number of response options. From the standard 
algebraic identify for the analysis of variance, with 


N = EN 
ізі 


N DS = МЕ. ED 


DD EN S (1) 
i=l i=l 
If the 1/N is negligible (1) may be written as 
= Dwi? + EWS, (2) 
where is 
= N/N 
$0 that 
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and 


LW FD = DVIS? = 8, (3) 
A ізі 
which indicates that the S? are not independent for all c categories. 

In obtaining empirically derived scoring weights it is, of course, 
desirable to have the variance of the resulting weights be of a 
minimum. If a large enough pool of examinees are tested in the initial 
test administration so that the л, are all large for each item, the 
variances will likely be small. This is not always the case, though, and 
it does not tell the item writer anything about how he should write the 
items to help insure that a small variance results. The item writer can 
have some influence over both the л; and the 5,2. By increasing the т 
and decreasing the S; and the ith option weight’s variance will 
decrease. But, the n; and 5/2 are not independent for a given item. 
Therefore, it seems reasonable to consider minimizing 5? and the im- 
plications this minimization has for item writers. One can see that 5? 
can be minimized by making the between options sum of squares, 
У. №, ?, a maximum. 

Although it is recognized that the following discussion is somewhat 
esoteric for the item writer and the conditions presented very un- 
realistic, the discussion following is an attempt to demonstrate some of 
the basic principles that should be used in minimizing 57. In maximiz- 
ing О = У. N;Y;?, a few things need to be noted. In the case where с 
= 2, it can be easily shown that О attains a minimum when 5-1“ Уи 
= 0, or when each category mean equals the overall mean. Also, if 
У нм У, сап be considered given and О a function of only the №5, 
О is minimized when N, = N/2. Since we are considering a finite pop- 
шайоп, a maximum value of О is obtained when all positive Yj are 
found in one category and all negative У, in the other. The zero values 
of Y, are placed in the category with the largest №. 

In cases where с > 2, it can be shown that О is minimized when 
Уу Yy = 0 for each i, or if the sums, £j- Yi), are considered 
fixed, when the N, are proportional to | ",ر‎ Ү,|. Maximum values 
can be obtained only when the criterion values can be partitioned into 
nonoverlapping regions, with each region corresponding to a group of 
people responding with a particular distractor. In topological terms 
these regions are termed “connected” regions, and their union consists 
of the entire criterion variable space. This is also the case where each 
distractor can be used to place the individual responding with that dis- 
tractor in a categorization of the criterion. 

In practice though, it is impossible for an item writer to write items 
with the property previously noted. The item writer can structure dis- 
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tractors in such a way that examinees of differing ability levels respond 
to different distractors. Such a practice would tend to approximate the 
condition mentioned previously, assuming that ability and the 
criterion are related, and allow О to be maximized as much as is prac- 
tical. The procedure of “facet design" as set forth by Guttman (see 
Elizur, 1970) is one method that might be used to so structure the dis- 
tractors. In examining the results of item pretesting, the quantity 0 
should also be taken into consideration in making the decision of 
whether or not to include a given item as part of a test that will be 
scored using empirically derived option weights. 
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THE APPLICATION OF DISCRIMINANT 
FUNCTION ANALYSIS TO CORRELATED SAMPLES 


G. FRANK LAWLIS 
Texas Tech University 


ARTHUR B. SWENEY 
Wichita State University 


A method of applying the discriminant function to correlated sam- 
ples was discussed for the researcher interested in utilizing multi- 
variate statistics to pre-post data. By finding the linear combination 
that maximizes differences in the groups, the t-statistic can be com- 
puted for correlated samples. 


For years researchers in the behavioral sciences have been attempt- 
ing to demonstrate the general effectiveness of therapeutic treatments. 
Although there are several designs from which to infer such influences, 
two methods come readily to mind. From one method one can com- 
pare separate groups that have had, or have been assumed to have 
had, differential treatments, i.e., the cross-sectional design. The other 
design involves the comparison to the same group over a series of time 
segments, i.e., the longitudinal approach. 

Statistical methods have been devised to determine if treatment out- 
Comes significantly differ with respect to a dependent variable, i.e., 
analysis of variance for block effects, t-tests for correlated samples, 
etc. However, these methods are utilized for only one dependent 
variable. ) 

For each variable tested, the probability of finding significance by 
chance alone increases geometrically. Consequently, researchers have 
to make decisions as to what dependent variables appear to be critical 
to their particular model of research. Theory building becomes a 
Process of inductive reasoning variable by variable testing. 
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The measurement of the effectiveness of a treatment utilizing a com- 
bination of variables can be facilitated through the use of discriminant 
analysis as the most appropriate statistical method since the procedure 
makes it possible to maximize the group difference by the most effi- 
cient combination of variables. Discriminant analysis is virtually 
always applied to discrete groups, and thus limited to cross-sectional 
studies. A problem arises when a researcher wishes to show change of 
one group over time with a combination of variables. In other words, 
the discriminant function has not been applied to longitudinal 
research. The purpose of this paper is to consider an application of the 
discriminant analysis to correlated or identical groups measured 
through time. 


Method 


Before the proposed technique is discussed, a simple explanation of 
the principle of the discriminant function should be presented. Con- 
sider the simplest case in which a researcher wished to combine two 
variables such as ego strength (A) and motivation (B) to discriminate 
between two discrete groups, successful therapy and nonsuccessful 
therapy cases, as determined by the researcher. If the researcher can 
use the scales as a coordinate system for a two-dimension model, those 
points can be represented in two-dimensional space. 

As seen in Figure 1, those points can be transformed to another 
linear scale C in which the group differences are maximized. That 
linear scale C will pass through the origin (0, 0) and the coordinates 
that would be the respective weights of the variables. In our example, 
the weights that maximized group difference were 2.00 A (ego 
strength score) and — 1.00 B (motivation). Each value can be weighted, 
summed and represented on the vector as a single point. 

The strategy is to determine the eigenvector that will provide the 
greatest separation between the two groups such that the between 
variance is maximized. 


: $ 
Mae variance ШИ of Squares between 
Sum of Squares within 


There can be nth dimensional space with п number of variables, and | 


there is no room to discuss the calculus of determining the respective 
weights in which orthogonal relationship could be assumed. For those 
researchers interested in these procedures, please refer to Tatsuoka 
(1970) for a more sophisticated and thorough explanation of thes? 
more complex computations. 


The primary consideration is that there is a possibility of represent | 
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nonsuccessful 
clients 


successful 
clients 


Motivation 


Ego Strength 


Figure 1. Data of two groups represented in two-dimensional space. 


ing the combination of variables in linear form maximizing group 
differences. Therefore, the final variable scores have the same property 
of occurring randomly as any of their sub-parts. As such, it can be 
used in reference to the (-distribution. 

The major problem in applying such a technique to dependent sam- 
ples is the assumption of interdependence between groups. That is, in 
most cases there is a positive correlation between pre- and PO, 
Therefore, one must make them statistically independent by subtract- 
ing the variance attributable to pre-test score from the post-score 48 
follows: 


“фм = Zpost — Гррбие 

2% = partialled score on post-test after variance explainable by pre- 
test is removed 

Фон = standard scores on post-test 


Zpre = standard scores on pre-test 


Typ = PM correlation between pre- and post-scores 
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It is then a simple matter to apply the t-test for correlated samples to 
that linear combination vector (Ferguson, 1959) in showing changes 
over time for a combination of variables. Moreover, by the value of 
the maximizing weights, a researcher could determine the most rel- 
evant variables to that change. 


Discussion and Implications 


Psychological research has frequently been directed toward deter- 
mining the effects of therapeutic intervention, yet change is difficult to 
attribute to any one variable, whether it is depression, disruptive 
behavior, or the disability. Moreover, error rate prohibits the 
statistical testing of a wide range of important variables analyzing each 
one by one. With the utilization of a combination of variables or in- 
dices of variables, such as behavior ratings, test scores, etc., the 
researcher may more easily demonstrate an outcome. As an illustra- 
tion, consider that the researcher found the combination of ego 
strength (A) and motivation scores (B) with the weights of 2.0 and 
— 1.0, respectively, maximized the pre-test scores from the post-test 
scores significantly (See Table 1). Finding a significant change from 
pre- to post-testing, he could make the inference that his treatment ap- 


TABLE 1 
Illustration Problem 
Pre-Score* Post-Score* 
Zpre Пром = (Zpost — ГррЁрге) 
Ego à Ego 

Ss Strength Motivation Combination** Strength Motivation Combination** 
1 9 4 6 5 4 6 
Ps: 5 3 6 8 4 

3 5 4 6 7 6 8 

4 3 3 3 8 4 12 

5 2 4 0 5 4 6 

6 5 5 5 6 5 7 

7 3 5 1 7 10 4 
817204 6 2 6 8 4 

9 6 6 6 8 8 8 
10 5 5 5 6 5 J 

* Stanines (X = 5, S.D. — 2) 
жж Weights Еро Strength = 2 Motivation 
UE 
t= = зара: 


147 € 2 
vue = (-2.9)/9 
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peared to be effective in making change for this combination of 
variables, ego strength being the more relevant. 

Perhaps this application of the discriminant analysis is overly 
simplistic, but the belief that change occurs in only one variable at a 
time is a more simplistic concept which has probably drastically 
limited complex theory building in psychology. Obviously, replication 
of findings using identical weights would show validation of the 
results, but trends could be determined by statistical confirmation. At 
least a method of deductive hypothesis testing could be realized. 
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A COMPARISON OF VARIABLE CONFIGURATIONS 
ACROSS SCALE LENGTHS: 
AN EMPIRICAL STUDY’ 


HOWARD G. SCHUTZ anp MARGARET H. RUCKER 
University of California, Davis 


Data from 2-, 3-, 6-, and 7-point rating scales were analyzed 
to determine whether scale length affected response patterns. The 
results of this study indicate that data configurations are rela- 
tively invariant with changes in number of scale points. 


IN educational and psychological research, rating scales are often 
used to collect data. An important question in constructing such scales 
is how many response categories to provide. This problem has been 
approached in several ways. Bendig (1954) and Komorita and Graham 
(1965) studied the reliability of information obtained using scales of 
different lengths. Their research indicates that reliability, at least of the 
Sum of a set of homogeneous rating scales, is independent of number 
of response categories. Research by Matell and Jacoby (1971) sup- 
Ported these results and also revealed that both stability and validity of 
cumulative scores from Likert-type items are independent of. the 
number of scale points. Finn (1972), however, found that for ratings 
9n a single factor, ratings on a 9-point scale were less reliable than rat- 
ings on shorter scales. Other aspects of the problem studied by Matell 
and Jacoby (1972) include the effects of testing time and scale proper- 
ties. This research revealed that proportion of scale used was indepen- 
dent of number of scale points (excluding 2- and 3-point scales) but 
Mean testing time increased and usage of an "uncertain" category 
decreased as number of scale steps increased. Green and Rao (1970), 
Using numerical simulation, investigated the effect of scale length on 

! The authors wish to thank Professor б. Е. Russell for his development of the com- 
Puter program used to calculate means, cross products, and similarity statistics used in 
this research. 
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ability to recover the original data configuration. Their research in- 
dicates that one should use at least six response categories—response 
degradation to 3-point or 2-point scales results in poor recovery of the 
original configuration. The present study was designed to extend these 
results by comparing factors produced from empirical data. 


Research Design 


As part of an ongoing food attitude research project, the authors 
developed four forms of a food-use questionnaire to investigate the 
effects of scale length on response patterns. These questionnaires were 
identical except for number of scale points. The questionnaires called 
for rating the appropriateness of ten different foods in ten different 
situations. The ten foods were utilized as the variables in this study. 
An example of these questionnaires is given in Figure 1. 

It was decided to include 2-, 3-, 6-, and 7-point scales in this study. 
These numbers were chosen to make it possible to compare short and 
the more commonly used longer scales, with and without midpoints. 
АП four scales were anchored at the ends with the terms "appropriate" 
and "inappropriate." 

Subjects were 60 male and 60 female students enrolled in a history 
course at a large western university. Fifteen male and fifteen female 
students were randomly assigned to complete each of the four forms of 
the questionnaire. 


Analyses and Results 


For each group, a mean rating for each food-use combination was 
computed. Raw cross-products computed from the mean ratings were 
factor analyzed to produce clusters, as suggested by Nunnally (1967, р. 
381). The Biomedical computer program used for these analyses per- 
forms a principal component solution and an orthogonal rotation of 
the factor matrix. 

For each of the scales, three factors accounted for over 97% of the 
variance: .993 for the 2-point, .988 for the 3-point, .982 for the 6-point, 
and .979 for the 7-point. 

The factor loadings were then examined for differences between the 
scales. Since the size of cross-product factor loadings varies with scale 
size, the loadings were converted to proportions to facilitate this IT" 
spection. These loadings are presented in Table 1. 

To obtain a measure of variability of the factor loadings that was 
comparable across scales, the standard deviations of the factor load- 
ings were converted to coefficients of variation. The resulting figures, 
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shown in Table 2, indicate some increase in variability as number of 
response categories increases. 

Since the study involved fixed variables and different samples, 
Tucker's (1951) coefficient of congruence was used to determine the 
extent of agreement between corresponding factor weights (Harman, 


Food Use Questionnaire 


We are asking you to complete the following food use questionnaire to help us learn 
about consumers' attitudes toward foods. Specifically, we are interested in deter- 
mining how consumers Ёсе] about the appropriateness of various foods in different 
Situations. Please use the folloving scale to fill in the grid at the bottom of the 
Page. For each food-use combination, select the nearest whole number that best 
represents how you feel about the appropriateness of the combination and write it 

in that cell, 


appropriate to. cL te о inappropriate 
1 2 3 n 5 6 7 


ample 
If you feel, for instance, that fruit (first food in this example) is appropriate 
for lunch (first use), you would put a "I" in that cell. If you feel that root beer 
(second food) is inappropriate for lunch, you would put a "7" in that cell. ОҒ 
Course, in completing the gríd at the bottom of the page, you should use whichever 
number most closely represents your opinion of the degree of appropriateness of а 
particular combination. 


= 
5 
Е 
5 
ы 
4 
a 


2. with nuts 


1. fruit 
2. root beer 


umns (as has been done in the example). 


Please fill in the grid working down the col 
h s їп each column before going оп to the next 


Preliminary research indicates that filling 
One is faster than working across the rows. You may not be familiar with some of 

the foods or have engaged in some of the uses or food-use combinations. Even if 

this is the case, for each food-use combination, please give us your opinion of how 
appropriate it is to use this food in this situation. Do пов leave any cells blank. 
Since we are interested in your opinion regarding appropriateness, we would appreciate 
it if you would not compare your responses with those of other people until after 


You have completed the grid. т 
a М > 
8 8.24 
Sua я 
BOR 7 8 
"$2877 Чьи 
mU am eh eet 
о зо Е ор 
Боевом ЕР 
BS pip eee е 
Ssbgs aus 28 
ЕЕ: A 
HN 
с 6 g 9,5595 
Е 
ЕЕЕВРЕННЕНЕ 
Иоан 
. jello I] 
. potato chips Я m 


1 

2 

3. сһіскеп 
4. orange juice || 
5. celery 

6. soup 
7. pizza 
8. cereal 
9. pie 
10. grapes 


Figure 1, 
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TABLE 2 
Coefficients of Variation of Cross-Product Factor Loadings 
Factor 1 Factor 2 Factor 3 
Two-point scale .1624 .1665 1171 
Three-point scale .2244 2260 .1878 
Six-point scale 2773 2710 .2300 
Seven-point scale .2984 .3387 12405 


1967). The only discernible trend in the results of these analyses was 
that the correlation between the 2-point and 7-point scale factor load- 
ings was slightly lower than the other correlations, for all three factors. 
However, the coefficients were all so uniformly high—.98 or higher for 
all comparisons—that the importance of this trend is questionable. 

A measure of the distance between profiles, D, was computed 
between means for all pairs of foods(Nunnally, 1967). The D values for 
each scale were divided by the maximum possible D for that scale, and 
these numbers were then subtracted from 1 so that the final values 
would range from 0 to 1 with 1 being identity. 

A Kruskal (1964) nonmetric multidimensional analysis was com- 
puted on the similarity data from each group. On the basis of the fac- 
tor analyses results, three dimensions were selected to be fitted in the 
Kruskal analyses. The resulting stress values were .042 for the 2-point, 
028 for the 3-point, .018 for the 6-point, and .018 for the T-point scale. 
There is some tendency for stress to decrease as number of response 
categories increases through 6-points. However, there is no difference 
гү the 6- and 7-point stress values and all of the stress values are 
low, 

Responses for each scale were then subdivided into male and female 
groups. Analyses of these subgroups produced no readily evident 
trends, either between males and females or across scales. However, 
different results may be produced with larger sample sizes. It appears 
that one does not get a stable factor structure with only 15 cases. 


Conclusion 


Within the limits of this study, it is concluded that number of 
available response categories, at least within the 2- to 7-point range, 
does not materially affect the cognitive structure derived from 
Tesponses to that scale. 


REFERENCES 


Bendig, A. W. Reliability and the number of rating scale categories. 
Journal of Applied Psychology, 1954, 38, 38-40. 


324 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Finn, R. H. Effects of some variations in rating scale characteristics on 
the means and reliabilities of ratings. EDUCATIONAL AND 
PsYCHOLOGICAL MEASUREMENT, 1972, 32, 255-265. 

Green, P. E. and Rao, V. R. Rating scales and information recovery— 
How many scales and response categories to use? Journal of 
Marketing, 1970, 34, 33-39. 

Harman, H. M. Modern factor analysis (2nd ed.). Chicago: University 
of Chicago Press, 1967. 

Komorita, S. S. and Graham, W. K. Number of scale points and the 
reliability of scales. EDUCATIONAL AND PSYCHOLOGICAL 
MEASUREMENT, 1965, 25, 987-995. 

Kruskal, J. B. Nonmetric multidimensional scaling: A numerical 
method. Psychometrika, 1964, 29, 115-129. 

Matell, M. S. and Jacoby, J. Is there an optimal number of alternatives 
for Likert scale items? Study I: Reliability and validity. 
со AND PSYCHOLOGICAL MEASUREMENT, 1971, 31, 

Matell, M. S. and Jacoby, J. Is there an optimal number of alternatives 
for Likert-scale items? Effects of testing time and scale properties. 
Journal of Applied Psychology, 1972, 56, 506-509. 

Nunnally, J. C. Psychometric theory. New York: McGraw-Hill, 1967. 

Tucker, L. R. A method for synthesis of factor analysis studies. Per- 
sonnel Research Section report, No. 984. Washington, D. C.: 
Department of the Army, 1951. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
1975, 35, 325-339. 


AN INVESTIGATION OF THE RASCH SIMPLE 
LOGISTIC MODEL: SAMPLE FREE ITEM AND TEST 
CALIBRATION! 


HOWARD E. A. TINSLEY? AND RENÉ V. DAWIS 
University of Minnesota 


This research investigated the use of the Rasch simple logistic 
model in item and test calibration. Tests employing word, picture, 
symbol, and number analogies were administered to high school 
students, college students, civil service clerical employees, and 
clients of the Minnesota Division of Vocational Rehabilitation. The 
results indicated that Rasch item easiness ratios and z item difficulty 
ratios were invariant with respect to the ability of the calibrating 
sample when an adequate sample was employed and the test design 
did not incorporate biasing factors. The invariance of the Rasch item 
easiness ratios was shown to be related to the goodness-of-fit of the 
items to the Rasch model in that the deletion of items with low Rasch 
probabilities increased the invariance of the Rasch item easiness 
ratios. The estimation of the amount of ability indicated by the raw 
Scores on a test was also shown to be invariant with respect to the 
ability of the calibrating sample for tests of 25 or more items, even 
when samples of fewer than 100 subjects were studied. 


‚ OVER 20 years ago Gulliksen (1950) remarked that the discovery of 
item parameters which would remain stable as the item analysis 
group changed would constitute a significant contribution to item 
analysis theory. More recently Lord and Novick (1968) have expressed 
ee Uf 
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a similar opinion. Within the framework of classical test theory а 
number of indices of item difficulty have been suggested which might 
possess this property. A normal curve transformation of P values to 2 
values, frequently referred to as Thurstone's method of absolute scal- 
ing, has been suggested by several authors (Bliss, 1929; Guilford, 1954; 
Horst, 1933; Thorndike, Bergman, Cobb, and Woodyard, 1926; Thur- 
stone, 1925, 1947). A second method commonly suggested for obtain- 
ing invariant item difficulty parameters, the limen method, has been 
described by Bliss (1929), Thorndike et al. (1926), and Tucker (1953). 
Modifications of the limen method have been discussed by Gulliksen 
(1950) and Richardson (1936). Both the method of absolute scaling 
and the limen method require the assumption of a normal distribution 
for the ability under consideration. Although both methods were first 
described 50 years ago, neither apparently has been investigated 
systematically. 

More recently, Rasch has introduced a “latent trait" model which 
purportedly makes possible sample-free item and test calibration 
(Rasch, 1960, 1961, 1966a, 1966b). A major advantage of the model is | 
its “objectivity,” i.e., the model allows the computation of item and _ 
test parameters from any sample of subjects since the estimation of the 
parameters is independent of the distribution of ability in the calibrat- 
ing sample. Schmidt (1970) has presented a proof that the Rasch 
model is the only model to produce objectivity. The purpose of this 
study was to investigate the objectivity of the Rasch model in item and 
test calibration. 

The Rasch model makes the following assumptions (Anderson, 
Kearney, and Everett, 1968; Brooks, 1965; Sitgreaves, 1963): 

1. Items are scored dichotomously, 
2. Speed does not influence the probability of a correct response, 
3. Given the parameters for item easiness (e) and subject ability 
(a), all responses on a test are stochastically independent, and - 
4. The probability of a correct response by individual i to item i$ 
a function of the ratio а/е). 
This last assumption excludes guessing and variations in item dis- 
crimination as factors which affect the probability of a correct - 
response. The effects of violating this assumption have been studied by 
Brink (1971) and Panchapakesan (1969). 

Only three investigations of item-calibration using the Rasch model | 
have been reported in the literature. Rasch (1960) used data from four | 
subtests of the Danish Military Group Intelligence Test BPP which was — 
given to 1094 Danish military recruits. He found the data fit his model | 
for subtests N (a test of finding the next term in a numerical sequence). 
and L (a test similar to Raven’s Progressive Matrices, but with groups. 
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of letters instead of geometric figures). The model was inadequate to 
explain performance on subtests F (a test in which geometric shapes 
аге to be decomposed into parts) and V (a test of verbal analogies). 
Rasch had used restrictive time limits with subtests F and V, however. 
When the time factor was controlled, the data for these subtests also fit 
his model (Rasch, 19662). 


Brooks (1965) wanted to determine if intelligence test data obtained 
from American public school children would fit the Rasch model. 
Samples of eighth graders and tenth graders in Iowa Public Schools 
(part of the standardization sample for the 1964 Lorge-Thorndike 
Intelligence Test) were employed in this study. Of the 243 items tested, 
177 (72.8%) fit the Rasch model. Brooks (1965) also investigated the 
invariance of item easiness ratios and concluded that Rasch item 
easiness ratios are invariant with respect to the ability of the calibrat- 
ing sample. 


Anderson et al. (1968) investigated the hypotheses that Rasch item 
easiness estimates are independent of the ability of the calibrating sam- 
ple, and that Rasch item easiness estimates are more stable when only 
items which fit the Rasch model are considered. The test used was the ' 
45-item spiral omnibus intelligence test for screening applicants to the 
Australian Army or Royal Australian Navy. Samples of 608 recruit 
applicants to the Citizen Military Force (CMF), and 874 recruit appli- 
cants to the Royal Australian Navy (RAN) were studied. Twelve items 
were deleted for zero or 100% correct responses. For the CMF sample 
30 items (91%) fit the Rasch model at the .01 level of confidence, and 
25 items (76%) fit the Rasch model at the more stringent .05 level of 
confidence. (The level of confidence represents the probability of 
obtaining the observed pattern of responses, assuming the Rasch 
model is adequate to explain performance on the item.) For the RAN 
Sample the corresponding findings were 22 items (67%) and 16 items 
(48%). The authors computed the product-moment correlation 
between the item easiness estimates obtained from the CMF and RAN 
samples. The authors concluded from the correlation of .958 (based on 
33 items) that the item easiness ratios were independent of the ability 
Of the samples upon which they were computed. When those items 
that failed to fit the Rasch model at the .05 level were deleted, a cor- 
Telation increased to .990. 

Calibrating a test using the Rasch model results in a logarithmic 
ability estimate being assigned to every possible raw score from 1 to K- 

К = number of items). This estimate indicates the amount of ability 
required to achieve that raw score. А comparison of the ability es- 
timates assigned to a given raw score by two samples of different 
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ability should indicate the degree to which the Rasch model calibrates 
a test independently of the ability level of the calibration sample. 

Wright (1967) reports one such investigation based on the responses 
of 976 beginning law students to 48 reading comprehension items on 
the Law School Admission Test. To obtain samples of different ability 
Wright selected two contrasting groups from his total sample. The 
*dumb group" included the 325 students who did poorest on the test, 
with a top score of 23. The "smart group" included the 303 students 
with the highest scores, their lowest score being 33. Wright compared 
the similarity between the two sets of Rasch ability estimates and the 
two sets of percentile ranks. He concluded that the Rasch model leads 
to sample-free test calibration while the "traditional" method does 
not. 

Anderson et al. (1968) studied the invariance of Rasch ability 
estimates, They correlated the ability estimates obtained from the 
CMF and RAN samples. The resulting product-moment correla- 
tion of .992 was interpreted as evidence that the ability estimate as- 
signed to a score on a test is independent of the distribution of ability 
in the calibrating sample. However, it is doubtful that the two samples 
actually differed in ability. 

This paper examines the application of the Rasch model to analogy 
test items. The following hypotheses were investigated: 

1. Rasch item easiness ratios are invariant with respect to the 

ability level of the calibrating sample. 

2. The higher the probabilities that the individual items fit the 
Rasch model, the more invariant the item easiness ratios are 
with respect to the ability level of the calibrating sample. 

3. Rasch ability estimates, assigned in the calibration of а test, 
are invariant with respect to the ability level of the calibrating 
sample. 

To provide a base line against which the invariance of the Rasch item 
easiness ratios can be compared, a conventional item easiness ра- 
rameter (the z item difficulty index) was also calculated and submit- 
ted to similar tests. 


Method 


Selection of Item Format 


Spearman’s “g” or general mental ability seems to be represented in 
almost all the major intelligence tests in use today. Helmstadter (1964) 
points out that tests dealing with abstract relationships (such as verbal, 
numerical, or symbolic analogies) come closest to representing what P 
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meant by “е.” For this reason, the analogy format was selected Гог 
study in this research. Guilford (1959) suggests that there are several 
different methods of asking analogy questions, i.e. figurally, sym- 
bolically, semantically, and behaviorally, depending upon the type of 
material used to present the question. To make the present study as 
general as possible, it was decided to study figural (picture), symbolic 
(number and symbol), and semantic (word) test items. Two types of 
symbolic material were used because of the intrinsic differences in the 
two and because Guilford (1966) has reported the discovery of more 
than one factor in some cells in his Structure-of-Intellect. 


Subjects 


Data were obtained from four samples of subjects: college students 
enrolled in an introductory psychology class at the University of Min- 
nesota; high school students enrolled in two suburban Twin Cities 
high schools; civil service clerical employees of the city of Min- 
neapolis; and clients of the Minnesota State Division of Vocational 
Rehabilitation (DVR). 

The samples were similar in terms of race, religion, and sex. The 
high school and college students were younger than the DVR clients 
and civil service employees, had fewer marital obligations, were better 
educated, came from homes which had higher family incomes, had 
better educated mothers, and had fathers employed in higher level oc- 
сираНопз. In comparison with the high school and college students, 
the civil service employees were older, had lower family incomes, and 
were far more likely to be married and have children. The DVR clients 
constituted the most heterogeneous sample in many respects but were 
less well educated and had lower family incomes than did the high 


School and college students. 


Instruments 


Four tests were used with the college and high school students: a 60- 
item word analogy test, a 60-item number analogy test, a 50-item ріс- 
ture analogy test, and a 40-item symbol analogy test. (For a discus- 
Sion of the test construction process, see Tinsley, 1972.) A 25-item 
Word analogy test was used with the DVR clients, while a 30-item 
Picture and a 30-item word analogy tests were administered to the 

linneapolis civil service employees. (These word and picture analo- 
8les had been selected in an unusual manner. The picture items were 
Selected from picture items surviving an iterative item analysis 
Procedure [see Tinsley, 1972]. The word analogies were then con- 
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structed from the picture analogies by substituting the word for the 
object in the picture.) 

Item responses were scored and submitted to analysis, using a com- 
puter program written by Wright and Panchapakesan (1969, 1970) and 
modified by Bart, Lele, and Rosse (1970). 

The first question of interest was whether the use of the Rasch 
model leads to item easiness ratios that are invariant with respect to 
the ability of the calibrating sample. Ten “two-sample” comparisons 
were made in this study (see Table 1). In each case a set of analogy 
items was completed by two samples differing in ability. The two sets 
of data were then independently submitted to item analysis. The 
product-moment correlation was calculated between the two sets of 
Rasch item easiness estimates and, for comparison purposes, between 
the two sets of z item difficulty estimates. 

The relationship between the “goodness-of-fit” of the item and its 
invariance was also studied. The Rasch item easiness estimates ob- 
tained for the two samples were correlated, first for all items, then for 
the remaining items after first eliminating those items that failed to fit 
the Rasch model for both samples at, respectively, the .01, .05, .10, .25, 
.30, .35, and .40 levels of confidence. A similar procedure was 
employed in investigating the relationship between the invariance of 
the z item difficulty estimate and the “goodness-of-fit” of the P value. 
The criterion levels used for this index were .20 < P < .80, .30 < Р < 
170, and .40 < P < 60. In both cases, the hypothesis was that the 
between-sample correlation would increase as the criterion became 
more stringent. 

Finally, the invariance of the ability estimates computed for each 


TABLE | 
Comparisons Made in Testing Invariance of Rasch Item Easiness Ratios 

Comparison Sample Analogy Items 
Code Number Sample 1 N Sample 2 N Type Numbers 
1 College 630 High School 319 Word 60 
I College 630 DVR Clients 89 25 
IH High School 319 DVR Clients 89 25 
IV College 216 Civil Service |269 30 
M College 492 High School 120 Picture 50 
VI College 492 Civil Service 269 25 
vil High School 120 Civil Service | 269 25 
ҮШ College 276 Civil Service 269 30 
IX College 492 High School 145 Number 60 


X College 630 High School 308 Symbol 40 
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raw score was investigated in each of the ten comparisons by com- 
puting the product-moment correlation between the two sets of in- 
dependently obtained ability estimates. 


Results 


Item Calibration 


Ten sets of data were relevant to an investigation of the invariance 
for Rasch item easiness and z item difficulty ratios. Tables 2 and 3 show 
the results of analysis of these data. For all items, in all but one com- 
parison, the correlation between independent estimates of Rasch item 
easiness differed no more than one point from the correlation between. 
independent estimates of the z item difficulty index. 

Four tests of the invariance of these parameters were performed 
with word analogies. The Rasch item easiness estimates obtained from 
college students correlated highly with those obtained from high 
school students (.95, comparison 1) and civil service employees (.91, 
comparison IV). At the other extreme, the Rasch item easiness ,es- 
timates obtained from DVR clients had near zero correlations with 
those obtained from college and high school students (comparisons 11 
and ІП). 

Four tests of the invariance of the item parameters were also con- 
ducted with picture analogies. The Rasch item easiness estimates ob- 
tained on the 50-item and 30-item picture analogy tests showed high 
Correlations (comparisons V and VIII), while those obtained on the 
25-item test showed low correlations (comparisons VI and VII). 

Item parameters obtained on the 60-item number analogy and the 
40-item symbol analogy tests yielded high correlations (comparisons 
IX and Х). 

The above results indicate the degree to which the item parameters 
are invariant when the analysis is performed on all items in the test. 
The Rasch model, however, cannot be expected to hold for items 
Which do not fit the model. For this reason the relationship between 
the invariance of the item parameters and the “goodness-of-fit” of the 
Пет was investigated. ` 

Elimination of items which did not fit the Rasch model resulted in 
Some increase іп the correlation between Rasch item easiness es- 
timates, but the results did not follow a simple pattern. Only com- 
Parison ҮШ (between college students and civil service clerical 
employees on 30 picture analogies) showed a steady decrease in in- 
Variance as items with lower Rasch probabilities were removed. In 
Contrast, comparison VII (between high school students and civil ser- 
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vice employees on 25 picture analogies) showed an initial increase in 
invariance when those items with Rasch probabilities below .01 were 
removed, but when those below .05 were removed, the correlation fell 
to near-zero, and fluctuated randomly with subsequent deletions of 
items. In comparison IX (between college and high school students on 
60 number analogies) the correlation increased when items with Rasch | 
probabilities below .01 were deleted, and remained stable until after 
deletion of items with Rasch probabilities below .25. At that point the _ 
correlation began to drop. 

The rest of the comparisons showed some increase in variance as 
items with low Rasch probabilities were deleted. In comparison IV. 
(between college students and civil service employees on 30-word 
analogies), the increase in correlation was somewhat erratic. In com 
parison II (between college students and DVR clients on 25-word | 
analogies), the item easiness estimates were negatively correlated. But 
this latter comparison and the comparisons of college and high school 
students on 60-word analogies (comparison 1), on 50-picture analogies 
(comparison V), and on 40-symbol (comparison X), all correlated .99 
when items with low Rasch probabilities were removed. 4 

The relationship was relatively more simple for the z item difficulty - 
estimates. In general, the less restrictive the range of acceptable item | 
difficulties, the higher the correlations. For each of the six com- 
parisons in which the 2 item difficulty correlated .90 or higher (com- 
parisons I, IV, V, VIII, IX, and X), the highest correlations were 
observed when all items were included in the comparison, and the cor- 
relation dropped with each restriction of the range of acceptable item. \ 
difficulties. The correlations fluctuated randomly with each restriction” 
of the range of acceptable item difficulties for the four remaining com- 
parisons (II, III, VI, and VII). 


Test Calibration 


In estimating the amount of ability indicated by raw scores on a test. 

it is claimed that the Rasch model takes account only of the easiness ОЙ 

the items in a test. It is appropriate, therefore, to ask whether the 
ability estimates are invariant with respect to the ability of the 
calibrating sample. For each of the ten comparisons investigated (seê 
Table 1), the product-moment correlation between the Rasch ability 
estimates was .999. Figure 1 illustrates the relationship between the 
ability estimates calculated for a 25-item word analogy test from the | 
responses of 630 college students and the responses of 90 DVR clients t 
(comparison II). Ў 
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Figure 1. Invariance of Rasch Ability Estimate. 


Discussion 


Item Calibration 


Ten tests of the invariance of Rasch item easiness estimates were 
made with mixed results. However, the results are not as equivocal as 
they may appear. Anderson et al. (1968) have pointed out that the 
Rasch model does not lend itself to small samples. Generally samples 
of 500 or larger are needed to obtain stable item easiness (and ability) 
estimates, It is important, therefore, to keep the size of the sample in 
mind when interpreting the results. Comparisons I and X were based 
On the responses of 630 college students and 300 high school students 
and yielded correlations of .95 and .98. Correlations of .97 and .93 
Were obtained when the results obtained from 492 college students 
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were compared with those obtained from 120 and 145 high school stu- 
dents (comparisons V and IX). And comparisons IV and VIII based 
on the responses of 276 college students and 269 civil service em- 
ployees yielded correlations of .91 and .88. In contrast, comparisons 
II and III involving item easiness estimates obtained from 89 DVR 
clients resulted in zero correlations. 

Two comparisons (VI and VII) remain, however, which did not sup- 
port the hypothesis under test. Both were based on small samples, but 
the samples were larger than some used in comparisons which did sup- 
port the hypothesis. It is possible that the nature of the test was a fac- 
tor in these results. Both comparisons involved the 25-item picture 
analogy test. It seems likely, therefore, that some factor other than 
ability and item difficulty may have influenced the probability of a cor- 
rect response. This factor might have been recognition of some of the 
picture analogies as identical to the preceding word analogies. 

Another factor which may have served to reduce the invariance of 
the item easiness ratios must be mentioned briefly. Panchapakesan 
(1969) provided a criterion for the elimination of examinees with low 
scores so that the estimation of item easiness will not be contaminated 
by guessing. According to her criterion some of the subjects in this 
study should have been eliminated. Because of the initially small sam- 
ple sizes, this procedure was not followed. It is possible, therefore, that 
guessing may have reduced the invariance of the item easiness ratios 
in some instances. 

In summary, six of the ten comparisons supported the hypothesis 
that the Rasch item easiness ratios were invariant with respect to the 
ability of the calibrating sample even though a number of the com- 
parisons involved samples of questionable size. Of the four remaining 
comparisons, two included samples so small as to invalidate the results 
while the other two may have been invalid because the Rasch model 
was not appropriate for tests designed in that manner. 

It must be noted that the results for the z item difficulty ratios fol- 
lowed those for the Rasch item easiness estimates. The data in the pre- 
sent study provide no basis for choosing between the two item 
parameters. Such a choice could be made, however, on the basis of the 
assumptions involved in the use of the two parameters. The z item 
difficulty estimate requires the assumption that the ability be normally 
distributed while the Rasch item easiness estimate requires no assump- 
tion about the ability of the calibrating sample. Therefore, either the 
samples used in this study were normally distributed in terms of 
ability, or z item difficulty estimates are robust for the assumption of 
normality. Е 

The above results represent a stringent test of the Rasch model in 
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that items for which the Rasch model is clearly inappropriate were in- 
cluded in the comparison. Deletion of these items should result in an 
increase in the correlation of the item easiness estimates obtained from 
different samples. This result was observed for five of the six valid com- 
parisons. In three of these comparisons (I, V, and X), the correlation 
increased to .99. In the other two cases (comparisons IV and IX), the 
correlation increased at first and then decreased. In both such in- 
stances the number of items remaining had become so small that the 
decrease in correlation may have resulted from a restriction of the 
range of item easiness estimates. Only comparison VIII (between civil 
Service employees and college students on 30 picture analogies) 
failed to support this hypothesis. Both samples completed these pic- 
ture items after completing 30 word analogies having identical 
relationships, thereby possibly contaminating their response to the 
picture analogies. 


Test Calibration 


The results of each of the ten comparisons supported the hypothesis 
that Rasch ability estimates are invariant with respect to the ability of 
the calibrating sample. Even in those instances in which the samples 
меге so small that the individual item easiness estimates were sample 
dependent, the resulting ability estimates were invariant. These results 
indicate that the ability estimates assigned to any collection of 25 or 
more items will be invariant with respect to the ability of the 
calibrating sample regardless of whether the separate item easiness es- 
timates are invariant. 


Conclusions 


The results of this research support the following conclusions: 

1. Rasch item easiness ratios are invariant with respect to the 
ability of the calibrating sample when a sample of adequate 
size is used. 

2. Invariance of the Rasch item easiness ratios is related to the 
goodness-of-fit of the items to the Rasch model. The deletion 
of items with low Rasch probabilities increases the invariance 
of the Rasch item easiness ratios. 

3. The estimation of the amount of ability indicated by the raw 
scores on a test is invariant with respect to the ability of the 
calibrating sample for tests of 25 or more items, even when 
relatively small samples are studied. 
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THE MEASUREMENT OF THE EFFECT OF 
SUGGESTION ON PERCEPTION’ 
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A scale based on experimental methods has been prepared for 
measuring the effects of indirect suggestion upon perception. Three 
categories are included: (1) distorting the interpretation of presented 
stimuli, (2) inducing sense-impressions in the absence of adequate 
stimuli, and (3) producing insensitivity to stimuli that are objectively 
present. Test situations were designed for tactual, auditory, and 
visual perception. The scale was tested on a sample of 112 students 
from the 11th and 12th grades of a large city high school (58 girls and 

4 boys). 

Most of the item intercorrelations were positive and many signifi- 
cantly so. Eliminating the 9 lowest items of 21 left 12 for a reduced 
matrix, with the first factor accounting for 2396 of the total variance. 
There were no special factors attributable to sensory modality. By 
summing the scores on the 12 items, a scale was produced with a 
reliability of .82; К 

Difficulties and limitations are discussed, along with the potential 
applications of such a scale in the study of socially important 
behaviors. 


IN his review of suggestibility in the normal waking state, Evans 
— а УЛАН А. 
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(1967) found that the reported experiments had explained very little of 
the basic phenomena. We agree with his assessment of the situation 
(see also the related discussion by Gheorghiu, 1972, 1973). For some 
time there has been considerable overlap in the investigations of sug- 
gestibility phenomena, particularly in sensory and motor tests (e.g., 
body-sway, arm and hand levitation, heat-illusion, progressive lines). 
Although these tests have been taken as a basis for understanding sug- 
gestion, they have in fact served merely to define the range of the 
behavior to be investigated. The paucity of firm conclusions is 
evidenced also in the factor analytic studies. As the investigations of 
Hammer and others (1963) and of Evans (1967) have shown, varia- 
tions in the tests produce different factors of suggestibility. 

The correlational and factor-analytic studies lead to the questions, 
first, whether the results show any clear unity or coherence within 
suggestibility phenomena, and, second, whether the measures meet ap- 
propriate experimental standards relevant to the behavior under in- 
vestigation. Heavy weight has been given in recent years to the in- 
terpretation of statistically detected factors, and little attention has 
been paid to developing and adapting behavioral measures based on 
underlying hypotheses about suggestibility and its manifestations. 

To the best of our knowledge, despite all the investigations that have 
been made both clinically and experimentally, there is not a single 
standardized test battery to be found. The clarification that can be ex- 
pected from a careful application of experimental methods is lacking. 

Looking back on the findings of different authors, gross differences 
are found, leading Duke (1964) to say that it is essential "that all 
details of the test situation be carefully examined, specified, and 
appraised." This is without doubt necessary, but it seems equally im- 
portant to be clear about the definition of what is measured by the sug- 
gestibility tests, and to specify the purposes intended by their in- 
vestigation. Our primary purpose has been to investigate interpersonal 
differences in the effects of suggestion on perception. With this 
background it will be possible in later investigations to relate sugges- 
tion to other influences upon perception, and to consider the findings 
in the framework of differential psychology. 

Although a complete sensory suggestibility test battery would have 
some variants in the kinds of suggestions used (direct, indirect, СО- 
judge, etc.) we have chosen to limit our first scale to perception using 
the single variant of indirect suggestion. There are three main 
categories of the influence of suggestion upon perceptions: (1) dis- 
torting the interpretation of presented stimuli; (2) inducing sense 
impressions in the absence of adequate stimuli; and (3) producing 1- 
sensitivity to stimuli that are objectively present. Many of the current 
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investigations of suggestibility have not carefully distinguished between 
these functions, and have not studied the functions systematically 
in reference to various sensory modalities. 

With the proposed scale, in which the functions studied are carefully 
specified and experimentally measured, further experiments can be 
performed to investigate many aspects of influences on cognitive 
processes such as imagination, expectation-deception, agreement tend- 
encies, etc. Preliminary experiments, previously reported (Gheorghiu, 
Hodapp, and Thiedig, 1972) suffice to show that it is possible 
to prepare such a scale. 


Experimental Design 


In order to carry out the plan for developing a scale that would test 
the effect of suggestion on perception, experimental test situations 
were designed for tactual, auditory, and visual perception, producing 
as far as possible analogous demands upon the subject, and having in 
mind the three categories previously mentioned. In all, seven tests were 
designed for each modality. The 21 tests are described in detail in 
Table 1. 

To make some of the sensory distortions more plausible, all tests 
were conducted with sensory input attenuated or impeded. We in- 
troduced an appropriate “impediment” for each sense: plaster on the 
skin for the area to be tactually stimulated, a foam-rubber insert in the 
headphones, and dark glasses before the eyes. The main purpose in us- 
ing the impediments was to increase the subject's concentration of at- 
tention, and to make the suggested sensory experiences more credible. 


Subjects 


Pupils from the 11th and 12th grade of a large city high. school 
Served as subjects in the investigation. With few exceptions, entire clas- 
ses were tested during school hours, although each testi WAS: BI VT) in- 
dividually. The total sample of 112 consisted of 58 girls and 54 boys. 
Ages ranged between 15 and 18 years. The majors were as follows: 
science, 38%; language arts, 47%; and social studies, 15%. | 
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would be to concentrate with utmost attention on what he perceived, 
if, and with what degree of certainty, he experienced the sensation. 


Procedures 


The functioning of each of the devices was demonstrated before 
every test, without, however, stimulating the subject directly. Each 
single task began with the signal "Attention" and lasted 15 seconds. 
The subject was instructed to say “Now” as soon as he perceived the 
stimulus, The reaction time between presentation and his saying 
"Now" was recorded. The subject was then asked to state whether his 
perception was "certain" or "uncertain." The items were always 
presented in the same order within a sensory modality, but the 
modalities were presented in random order. Each item was presented 
twice, once to one hand, ear, or eye, next to the other hand, ear, or eye. 
Total testing time was between 40 and 50 minutes. 

There were two male and two female experimenters (psychology 
students) who worked independently. Each experimenter tested ап 
equal number of male and female pupils, the same expermenter con- 
ducting all tests with the same pupil. 


Results 


Item Reliability on Retesting 


As indicated, we carried out two trials for each item. The computed 
Phi-coefficients based on the simple dichotomy for each item of “reac- 
tion/no reaction” lie between 0.25 and 0.57 with two exceptions only 
(Tactile Item 4 and Tactile Item 5). All the coefficients are positive, 
and can be considered as reliability coefficients. When only one of the 
two items was passed there appeared to be no tendency for the first ог 


the second to be more frequently passed, as determined by the Х? (chi- 
square) test of McNemar. 


Item Intercorrelations and Factor Analysis 


If there is some common or unified aspect of suggestibility running 
through the tests, it is to be expected that the scores will be positively 
correlated. For each item the score could be 0, 1, or 2, depending оп 
whether the response failed to occur or occurred on one or both trials. 
When all of the intercorrelations were examined, it was found that 


* The help of E. Feingold, К. Krein, G. Ries, and Chr. Wortmann, who worked with 
us in conducting the experiments, is gratefully acknowledged. 
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most of them were positive, many of them significant, and of the few 
that were negative, none was significant. In order to emphasize the 
common factor, 9 of the 21 items yielding the lowest correlations with 
the total test were eliminated in the further analyses. The reduced 
matrix, based on the remaining 12 items, is given in Table 2, Most of 
the correlations are significant, and none is negative. It may be infer- 
red from such a correlation matrix that there will be a common factor 
present. It was found that the latent root of the first principal compo- 
nent came to 4.74, representing 22.6% of the total variance, and further 
roots were negligible (Figure 1). 

Multi-factorial solutions, rotated according to Kaiser's varimax 
criterion, failed to yield any factors unique to either the special sense 
modalities or the categories of suggestion. There are some technical 
eo in interpreting this finding, to which reference will be made 
ater. 


Item Analysis 


If all items are summed, as though there is a 21-item scale, the total 
scale yields a reliability coefficient of .80, according to the method 
Hoyt and Stunkard (1956), equivalent to the Kuder-Richardson 
alpha for weighted items. Because a few of the items had correlations 
with the total test (part-whole corrected) well below .30, they were 
eliminated as noted previously, and 12 items retained. This 12-item 
scale had a reliability of alpha = 0.82, according to Hoyt-Stunkard. 

Table 3 shows the results of the item analysis for the selected items. 
Of the items retained, three (Items 1, 5, and 6) were common to all 
three modalities, one was common in addition to audition and vision 
(Item 4), and another (Item 3) was found for vision only. Items 2 and7 
Were unsuccessful by this criterion for all sense modalities. | 

Although various weighting methods were tried out, the preferre 
method proved to be to use the sum scores as previously described, 0, 
1, or 2 for each item. The distribution of the sum scores based on the 
12 items was skewed to the right with a mean of 8.78 out of a р 
24. The standard deviation was 5.31, and the standard error о 
measurement 2.25. Hence there is a 95% probability that the “true test 
Value” will fall within the limits of X + 4.41. 


Subjective Certainty and Reaction Time 


5 When subjects were classified as more hig 
Cores of 8 to 24) or less highly suggestible (W 


9). The item analyses were 
жүзігі Rechenzentrum Darm- 


hly suggestible (with sum 
ith sum scores of 0 to 7), 


1 
For the statisti Li 
ical methods employed, see М 

Mai eed by means of Program IT 09, H. Vorkauf, Deutsches 
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Figure 1. Latent roots of the principal components extracted from the matrix of cor- 
relations of the 12 items (Table 2). 


"certain"-reactions were significantly more frequent among the highly 
Suggestible, but they predominated for both groups. 

In general, those subjects who reacted "uncertain" took relatively 
More time to respond than those who reacted "certain," but the 
relationship between “certainty” and speed of reaction was not strong, 
and did not show up as a connection between the sum scores and reac- 
tion time. 


Experimenter Effect and Sex of Experimenter and Subject 


The four experimenters produced somewhat different results. The 
Taw scores for each subject were transformed by the log (1 — X) to 
eliminate the dependence between mean values and the variances of 
the groups of subjects (Table 4). A simple analysis of variance of these 
transformed scores, classified by experimenter, resulted in an F-value of 
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TABLE 3 
Characteristic Values of Item Analyses of the 12 Selected Items 
Item mean value stand. deviation — item-test correlation 
| Tactile 1.49 0.72 0.28 
SD 0.35 0.56 0.31 
6T 0.84 0.84 0.48 
1 Auditory 0.98 0.79 0.50 
4A 0.74 0.82 0.48 
5A 0.62 0.75 0.40 
6A 0.62 0.79 0.58 
1 Visual 1.08 0.85 0.44 
Jy. 0.54 0.76 0.61 
4У 0.36 0.66 0.49 
5У 0.70 0.85 Р 0.51 
6v 0.46 0.70 0.61 


2.84 (f; = 3, fı = 108, p < .05). Although the female experimenters (3 
and 4) produced the higher mean scores, the difference of scores by sex 
of experimenter was not significant (according to the SchefTé method). 

The point-biserial correlation between sex of the subject and the raw 
sum score was .14, indicating essentially that the scale is independent 
of the sex of the subject. The interaction between the sex of the subject 
and the sex of the experimenter also proved nonsignificant. 


Discussion 


The main result of this investigation was that a consistent scale 
could be constructed to measure suggestible behavior in the area of 
perception, using an indirect approach. 

Although the factor analysis yielded a first factor accounting for 
23% of the variance, and multiple-factor methods led to no meaningful 
subordinate factors, this indication of a unity or coherence underlying 
the measures must not be accepted uncritically because of the manner 


TABLE 4 ? 
Mean Values and Variances of the Sumscores, Separated According to Experimenters 
(N = 28 for Each Experimenter) 


Experimenter mean value variance 
| 8.07 26.37 
male 
2 6.86 13.76 
3 8.64 21.79 
female 


4 11.54 41.74 
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in which test items enter into а determination of factors. It appears 
quite clear that there are no special factors attributable to sensory 
modality, because the three modalities were amply represented in the 
tests. However, the three types of influence upon perception were not 
equally represented, and it appears that the common factor can be 
considered more representative of perceptual distortion than of either 
producing a perception in the absence of a stimulus, or of negating a 
stimulus objectively present. Items 1, 5, and 6, that appear in the final 
scale for all sensory modalities, represent, first, an increasing stimulus, 
second, the continuation of a series following objective stimulation, 
and third, the supplementation of a single stimulus by its bilateral 
representation. These appear to fit the distortion paradigm better than 
the pattern of creating a perception in the total absence of sensory 
stimulation. Because more items are of this kind than of pure represen- 
tations of creating a perception or of completely annulling one, it may 
be that the appearance of special factors is in part owing to the relative 
frequency of acceptable items in the three categories. The one item 
Clearly reflecting annulment of a stimulus (Item 7) does not appear in 
the final scale at all because it correlated too low with the total test. If, 
however, there had been several such low correlations correlated with 
each other, they would then have determined another factor. A clarifi- 
cation of this problem is a task for the future. 

In the item analysis the visual items turned out to be more represen- 
tative of the scale as a whole than the items of the other senses, with 
five visual items appearing in the 12-item scale as against four auditory 
апа three tactile items. The advantage for the visual items may owe to 
the better control of conditions of testing, including darkening of the 
experimental room as a possible support for the influence of sugges- 
tion. 

The influence of the experimenter upon the results is somewhat dis- 
turbing. With a small sample of experimenters the result could well 
have been due to slight differences in the manner in which the tests 
Were presented to the subjects. Therefore it seems necessary, especially 
In suggestibility research, to unify and control the experimental perfor- 
Mances even more carefully than we did. 

The experiment raises a number of questions that can be answered 
only by further investigation. For example, it is not known whether or 
Not the certain" and “uncertain” responses are influenced by the per- 
Sonality of the subject, and the findings on reaction time suggest that 
the optimal duration of the experimental item should be studied. 

Because these experiments have dealt with only the indirect variant 
Of suggestion, no definite statements can be made about the compo- 
nents that cause the given response. It is an open question whether we 
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are dealing primarily with the ability to imagine or, instead, with a 
tendency to agree or to comply. The possibility of social compliance is 
always present when one relies, as in these experiments, on the verbal 
reaction as the indicator of suggestible behavior. It will therefore be 
necessary to carry out other experiments with other variants to deter- 
mine which components are operative. 

The present scale, supplemented by other variants, should 
ultimately provide a measure of individual differences in known 
behavior that can then be used in studying other socially important 
behaviors, whether imagination, tendencies to agree or disagree, or 
responsiveness to expectations of various kinds. 


Summary 


1. The experiments yielded a scale that measured the influence of 
suggestibility upon perception consistently in the areas of tactile, 
auditory, and visual perception. 

2. By weighting the responses according to whether one or both of 
the tests was passed with repeated items, a 12-item scale with an inter- 
nal consistency of 0.82 resulted. 

3. The degree of reaction was also differentiated by the subjects 
report of “certain/uncertain” following each perceptual report. 

4. The data indicated a significant experimenter effect. Although 
the two female experimenters produced numerically higher scores than 
two male experimenters, the difference was not significant. We also 
found no significant differences based on the sex of the subjects. 


REFERENCES 


Duke, J. D. Intercorrelational status of suggestibility tests and hyp- 
notizability. Psychological Record, 1964, 14, 71-80. 

Evans, Е. J. Suggestibility in normal waking state. Psychological Bul- 
letin, 1967, 67, No. 2. niu 

Gheorghiu, V. A. Betrachtungen über Suggestion und Suggestibilitat. 
(On suggestion and suggestibility). Scientia, 1972, 107, 811-860. 

Gheorghiu, V. A. Untersuchungen zur sensorischen und motorischen 
MERO unveróffentlichte Habilitationsschrift, Mainz 


Gheorghiu, V. А., Hodapp, V., and Thiedig, S. Untersuchungen zur 
taktilen, auditiven und visuellen Suggestibilität. Archiv ЛИ 
Psychologie, 1972, 124, 303-320. and 

Hammer, A. G., Evans, F. J., and Bartlett, M. Factors in hypnosis an 


ДЕР: Journal of Abnormal and Social Psychology, 1963, 67, 


Hoyt, C. I. and Stunkard, С. L. Estimation of test reliability for 0 
restricted item scoring methods. EDUCATIONAL AND 
i PsYCHOLOGICAL MEASUREMENT, 1956, 12, 756-758. 
Lienert, G. A. Testaufbau und Testanalyse. Weinheim, 1969. 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
1975, 35, 353-359. 


A STUDY OF THE EFFECT OF THE VIOLATION 
OF THE ASSUMPTION OF INDEPENDENT 
SAMPLING UPON THE TYPE I ERROR RATE 
OF THE TWO-GROUP 1-TEST 


ROBERT W. LISSITZ AND STEVE CHARDOS 
University of Georgia 


This paper describes some of the situations in which a psychologist 
is likely to violate the assumption of independent errors. A Monte- 
Carlo study of the effects of this violation is then described. A 
number of examples of different kinds and degrees of dependency are 
included along with a table that gives the effect of the dependency 
upon the shape of the test statistic’s distribution. The results of this 
study demonstrate that this assumption is a critical one. Researchers 
are strongly urged to avoid hypothesis testing if they suspect that the 
assumption of independence has been violated. 


THIS paper is an attempt to examine the effect upon the type І error 
rate of the violation of the assumption of independent random sam- 
pling within the specific context of the two group r-test. It is our feeling 
that the results generalize far beyond this situation, but for con- 
creteness we will consider this case in some detail. This is an assump- 
tion that has been largely ignored in the statistical literature concerned 
With robustness. Boneau (1960) has written what is probably the most 
Popular article for psychologists on the subject of assumptions of the 
t-test, His article does not treat the independence problem at all. There 
àre probably some reasons for the lack of interest in the subject of this 
assumption but it is our feeling that this neglect is most unfortunate. 

One reason that we feel that the assumption needs study is that it is à 
Common problem in psychological applications. For example, con- 
Sider a study in which the experimenter is interested in the behavior of 
Subjects who are participating in therapy groups. He might run three 
groups under one condition and three groups under another condition 
Copyright © 1975 by Frederic Kuder 
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and then treat the design to a t-test. Instead of there being three ex- 
perimental units (one for each group) he uses the individual scores 
from each person in the group. These individual scores are, of course, 
dependent upon one another. Another example of a situation in which 
dependency can arise is one in which the subjects are volunteers (such 
as an introductory psychology class). If the subject likes the experi- 
ment the subject will tell his friends and they will come to the ex- 
periment with a certain set that is obtained from the first subject. This 
introduces a certain level of dependency among subjects. Another 
example is one in which the dependent variable results from a rating 
procedure in which a single rater looks at more than one of the sub- 
jects, again introducing a dependency. 

Another reason for the importance of studying this assumption is 
that it appears in most statistical tests. A common criticism of metric 
Statistics is that they make assumptions that are untenable and then 
the critic suggests using nonparametric or distribution free statistics. 
Unfortunately these alternatives also make the assumption of in- 
dependence across subjects. 

The effect of making a variety of assumptions has been the subject 
of two excellent, recent reviews. One of these reviews is by Glass, 
Peckham, and Sanders (1972) and the other by Huber (1972). As noted 
by Glass, et al. (1972), there have been two papers and a book that are 
concerned with the problem of dependency. G.E.P. Box (1954) has dis- 
cussed some of the problems related to analysis of variance and the 
effect of violation of assumptions. Sometimes the violation of assump- 
tions cannot be avoided by more careful experimental design. As he 
states (page 484) **Data occur, however, in circumstances where there 
is no possibility of using this device (randomization), usually because 
the factor which is to be studied is the effect of time or position, which 
itself gives rise to the correlation." He considers the situation in which 
there is a serial correlation between errors and provides a table which 
indicates that the true type I error will far exceed the nominal value 
provided by the experimenter when there is even a relatively small 
positive correlation. Cochran (1947) indicates that a constant correla- 
tion between every pair of scores can have a large effect upon the true 
variance of the treatment mean and, therefore, an effect upon the 
resulting test statistic. He does not pursue this problem in any great 
detail and does not present data in terms of the type I error rate. 

The book by Scheffé (1959) has a short section on the effect of viola- 
tion of the assumption of independence of errors. Scheffé also deals 
with the case of serial correlation and briefly describes the effect upon 
confidence intervals of the mean under large sample conditions. He 
presents a table of the true alpha error rate that indicates that with 
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- positive serial correlation the alpha rate goes up and with negative 
_ serial correlation it goes down. 

These few references are all that are mentioned in the two recent 
reviews referred to above. There is additional work but it is pre- 
liminary and/or impossible to apply to most psychology problems, 
| at the present time. 

The following sections of this paper describe a Monte-Carlo study 
_ in which the effect of nonindependence is analyzed. It is hoped that 
_ the following series of realistic examples will illustrate the need for 
| more careful attention to this assumption. 


Procedure 


_ A FORTRAN IV (double precision for the CDC 6400) computer 
Е Program’ was written that would generate successive samples of data 
_ that are used to calculate two sample t-tests. A multivariate normal 
‚| Benerator? was used that was extremely accurate even for a very large 
ье of dimensions. The use of this generator allowed for the 
Specification (and hence variation) of the covariance structure of the 
bjects within a group while maintaining the normality and equal 
- Variance assumptions as well as the null hypothesis of no difference in 
- the means, 
А mean vector of zeros (the number of elements equal to the 
umber of subjects in a group) was input along with the variance- 
Covariance structure desired. The variance-covariance matrix always 
Consisted of ones in the diagonal, thus making the off-diagonal values 
—tquivalent to correlations between subjects within a group, across 
Ш Teplications. 


T 


„А COPY of the complete program is available upon request from the first author at 
# © Psychology БААСЫ Еа Uriel) of Georgia, Athens, Georgia, 30602. 80 

* would like to thank Dr. Rolf Bargmann of the Statistics Department, University 

` hi orgia, for the generous loan of his excellent multivariate normal generator and for 

E H help in operationalizing the subroutine. 

" This means that the null hypothesis of equal means was true even though the 

| атас structure has been altered. In other words, the effect of the dependency 

Ong subjects is not systematic by group in such a way that one group will have a 

* eer mean than the other. This is an important type of dependency, but the effect upon 

probability of rejecting the null hypothesis is clear—it will increase this probability. 

Contrast, the effect of the types of dependency we are considering is not at all clear. 
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Because of the expense of computer time only one sample size was 
chosen for this illustration. The size selected was 31 subjects in each 
group giving equal sample sizes for the two groups and 60 degrees of 
freedom. This, of course, means that the multivariate normal 
generator had a mean vector with 31 elements (each zero) and a 31 X 
31 variance-covariance matrix. It was our feeling that this represented 
a "typical" application of the two group t-test. 


Results 


The first set of t-values to be run consisted of the case in which the 
variance-covariance structure was an identity matrix. This was tabled 
as a check on the computer program. As can be seen in Table | the 
column titled “independent” resulted in a distribution of t-values that 
are very close to the expected set of t-values. The differences that did 
result are extremely small.“ 

A variety of types of dependency were investigated and are also à 
part of Table 1. The first are the cases in which every subject is cor- 
related with every other subject either .2 (col. 1), or .4 (col. П), or —.2 
(col. Ш), or —.4 (col. IV); The variance-covariance matrix, in these 
cases, is a constant matrix except for the diagonals. These last two 
cases are nongramian and, therefore, impossible in practice, although 
the computer program was written to handle these cases. They are in- 
cluded here as idealizations of reality and because the results were sen- 
sible and very interesting. As can be seen, the effect of non- 
independence is extremely large even in the .2 case. The effect of 
positive dependence is to increase the size of the tails and the effect of 
negative dependency is to decrease the size of tails of the empirically 
derived distribution. 

Another type of dependency that might be expected in psychological 
experiments is that of the serially correlated subjects. This set of 
variance-covariance matrices consists of zeros everywhere but the 
main diagonal, and the diagonal on either side. That is, adjacent sub- 
jects are correlated with each other and independent of all other sub- 
jects. Again, dependencies of .2 (col. V), .4 (col. VI), —.2(col. VII), and 
—4 (col. VIII) were run and tabled. This set of results agrees with 
those found by Box (1954) and Scheffé (1959) and the general conclu- 


sions are the same as for the earlier data except that the magnitude of 
the effect is less. 


* The chi-square test of goodness of fit was barely significant at the .05 level, but (in 
our opinion), for the purposes of this paper, the results are quite clear and not distor 
by the very minor divergence detected by the statistical test. It should be noted that t 
power of this test is very great since it involves 1,000 t-values. 
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A third type of dependency is a generalization of the serial type in 
which each subject is correlated with his two adjacent subjects. The 
covariance matrix has, in addition to the ones in the diagonal, con- 
stants of .2 (col. IX) in the adjacent two diagonals or .4 (col. X) in 
these adjacent two diagonals. Every other value is zero. The effect of 
this dependency is to increase the type one error rate. The amount of. 
increase for the .2 case is roughly equivalent to the .4 serial case, and 
the .4 case is much larger. 

A fourth type of dependency is that in which every subject is as- 
sumed to communicate with five other subjects and the amount of 
communication is further assumed to diminish across the five subjects. 
The covariance matrix was designed to reflect this. The values adjacent 
to the main diagonal are, from the left: .1, .2, .3, .4, .5, then the І. in 
the diagonal and to the right, are: .5, .4, .3, 2, .1. This case, as can be 
seen in Table 1 (col. XI), gives rise to an increase in the true type I er- 
ror rate that is comparable to the constant .4 dependency matrix. 


Discussion 


The results section indicates quite clearly that the t-test is not robust 
to the assumption of independence. Even when as small as 4% of the 
total variance is shared between subjects (i.e., 96% independent) the 
effect upon the true alpha level is considerable. The appropriate con- 
clusion for the user of this, and probably any other test, is to ignore the 
significance level if he has any reason to believe that there is a lack of 
independence. 

Considerable literature on the general linear hypothesis involving à 
general variance-covariance matrix exists. Some of this work is 
presented in the book by Johnston (1972). A particularly important 
point for the user is that the linear model in which the disturbance 
terms are incorrectly assumed to have zero covariance does not lead to 
bias in the estimation of parameters of the model. Instead, the prob- 
lems arise with the variance of the estimator. It is no longer minimum 
variance. The minimum variance unbiased estimator requires 
knowledge of the population covariance structure. The implication of 
this material seems to be that if the researcher is interested in just 
estimating parameters, he will be in less trouble than if he tries 10 US? 
classical hypothesis testing. 
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THE RELATIVE VALIDITY OF SCALES PREPARED 
BY NAIVE ITEM WRITERS AND THOSE BASED 
ON EMPIRICAL METHODS OF PERSONALITY 

SCALE CONSTRUCTION? 


DOUGLAS N. JACKSON 
University of Western Ontario 


In an effort to evaluate alternative strategies of personality scale 
construction, the extent to which relatively naive item writers could 
produce valid personality scales was investigated. Each of 22 under- 
graduate psychology students was asked individually to prepare 16 
items for one of three scales: Social Participation, Tolerance, and 
Self-Esteem. These items were administered to a sample of 116 
females, comprising pairs of roommates, together with like-named 
scales from the empirically-derived California Psychological In- 
ventory (CPI) and from the Jackson Psychological Inventory (ІРІ). 
Self-ratings and roommate ratings on these trait dimensions which 
Served as the two criterion measures were also obtained. Student 
scale validities, which were much higher than those obtained for the 
CPI, were almost comparable to those for the JPI. Student scale 
Scores were less free from desirability variance than were JPI scale 
Scores. Like the earlier Ashton-Goldberg study, the results were 
interpreted as supporting a construct-oriented over an external- 
empirical strategy of personality scale construction. 


THE two major purposes of this study were (1) to appraise the extent 
to which relatively naive item writers can produce valid personality 
Scales, and (2) to compare these scales with scales derived from other 
Strategies, particularly those employing empirical methods using an 


à Portions of this paper were presented at the meetings of the Canadian Psychological 
r aciation, Windsor, Ontario, June 12, 1974. Supported in part from a research grant 
вот the Canada Council. Grateful acknowledgement is made to Cheryll Kuhwald, 
dU Buckley, Margaret Rintoul, and Esther Wagner for their assistance. 
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external criterion. There has been a continuing concern (Hase and · 
Goldberg, 1967; Goldberg, 1972) regarding the optimal method for 
constructing personality scales. Recently, the author (Jackson, 1971) 
issued a challenge to the effect that if the most elaborate empirical 
strategies of personality scale construction were pitted against the - 
work of one or two item writers each of whom spent about two or 
three hours, validities would be higher for the scales constructed by the 
item writers. This point of view, by no means widely accepted, was is- 
sued as a challenge to researchers to conduct necessary comparative | 
studies to determine whether and under what conditions the assertion 
might be accurate. 

In an important study Ashton and Goldberg (1973) undertook to 
evaluate systematically the extent to which use of naive and novice 
item writers might be superior to other strategies of personality scale 
construction. They compared a set of scales prepared by two groups 
without extensive experience with personality scale construction— 
laymen and psychology graduate students—to sets of scales de- 
veloped by the more elaborate empirical procedures drawn from the 
California Psychological Inventory (CPI). For the three scales in- 


vestigated, Sociability, Achievement, and Dominance, peer-rating 
validities were consistently higher for personality scales developed by 
psychology graduate students than for those drawn from the widely- 
used California Psychological Inventory. Validities for scales de- 
veloped by laymen, although significant in general, were considerably 
lower than those validities for scales prepared by the graduate stt- 
dents. Additionally, convergent validities of psychology students were 
found to be in a comparable range with the much more elaborately 
constructed Personality Research Form (Jackson, 1967), which em- 
ployed a combination of rational and empirical procedures concerned 
with obtaining optimum levels of internal consistency and freedom 
from bias. 

The present investigation, which was, in part, a replication and 
extension of the Ashton-Goldberg study, employed (a) a different sel 
of personality constructs, the definitions of which are not quite so ob- 
vious to laymen as to professional item writers, (b) different methods 
for obtaining peer ratings, and (c) a somewhat different population 0 
item writers. In addition to validity, personality scales were also 
evaluated for freedom from response style variance. It is hoped that by 
introducing procedural variations additional insight regarding th? 
generalizability of the important Ashton-Goldberg findings will be- 
come possible. This procedure in turn should provide additional data 
bearing on the radically different approaches to scale construction 
proposed by Meehl (1945) and others and by Jackson (1971). 
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Method 
Subjects 


A total of 22 undergraduate psychology students, largely in their 
third year of university, comprised the set of item writers employed in 
the study. АП were enrolled in the author's course in Psychological 
Tests and Measurement. 

The respondents to the personality scales were 104 females drawn 
from a large women's residence at the University of Western Ontario, 
all of whom were paid volunteers. A prerequisite in the study was that 
each person have a roommate who was also willing to participate. 


Procedure 


The student item writers were first introduced to the question of 
different methods of scale construction and to the hypothesis that 
novice item writers could produce personality scales with worthwhile 
validities. The three targeted personality traits, Social Participation, 
Tolerance, and Self-Esteem, were next introduced. Random assign- 
ment of students to one of the three personality scales yielded a total of. 
Seven individuals for Self-Esteem, seven for Social Participation, and 
eight for Tolerance. The following definitions, representing an 
amalgamation of definitions reported for corresponding CPI scales 
(Gough, 1957) and the author's as yet unpublished Jackson Per- 
sonality Inventory, were presented to student item writers. (Since there 
18 no Self-Esteem scale on the CPI, the Social Presence scale, with a 
Substantially equivalent scale definition, was employed). 

. Social Participation. Sociable, friendly, gregarious; will eagerly 
Join a variety of social groups, seeks both formal and informal 
association with others. Values positive interpersonal relation- 
Ships; Outgoing, sociable, participative temperament. 

Tolerance. Broad-minded, undogmatic, open-minded; accepts 
People even though their beliefs and customs may differ from his 
Own; open to new ideas; free from prejudice; permissive, accept- 
'ng, and nonjudgmental in social beliefs and attitudes. i j 

Self-Esteem. Self-assured, confident, self-sufficient; poised in 
dealing with others; not easily embarrassed or influenced by others; 
'mperturbable in interpersonal situations; poised, spontaneous, 
and self-confident in personal and social interaction. 

n addition to scale definitions, students were provided with a 

"minute lecture on the basics of item writing. Basic principles of sim- 
Me, direct good grammatical usage; conciseness in statements; free- 

Om from extreme levels of evaluation; the concept of content satura- 
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tion; and use of medium levels of item popularity were emphasized in 
this presentation. Students were then instructed to prepare a set of 10 
true-keyed and 10 false-keyed items bearing on the personality dimen- 
sion to which they were assigned. It was suggested that they not spend 
more than two hours at this task. As an added inducement, students 
were advised that the item writer in each scale category whose scale 
obtained the highest validity with respect to peer ratings would re- 
ceive a prize of $10.00. Students were further instructed to identify 
after they had written all 20 items the two weakest true-keyed and two 
weakest false-keyed items in their set. These were excluded, but no 
further editorial discretion over item selection was exercised. 

The set of 22 16-item scales together with separate 16-item scales for 
Acquiescence and for Desirability was incorporated in a booklet en- 
titled “Personality Inventory," and was designated Student Per- 
sonality Inventory for reporting purposes. 

The Desirability scale was taken from Form A of the Personality 
Research Form (Jackson, 1967); the Acquiescence scale was com- 
prised of CPI items drawn randomly from the neutral range of de- 
sirability. Also included in the booklet were the three corresponding 
scales from the CPI. In order to conform to time limitations, and to 
make the CPI scales comparable in length to the students' scales, each 
of three sets of 16 items was chosen randomly from the CPI items com- 
prising each of the three scales. 

After the residents in the women's residence had completed the 
Jackson Personality Inventory and the Student Personality Inventory, 
they were instructed to complete a schedule containing peer ratings 
and self-ratings which served as the two criterion measures. The room- 
mate ratings used for the target traits were based on a 9-point rating 
scale, ranging from extremely characteristic to extremely uncharacter- 
istic of the degree to which the roommate possessed the named trait, 
Social Participation, Tolerance, or Self-Esteem, each with the identical 
definitions given item writers. 


Results 
Analyses of Predictor Relationships 


Table 1 presents the intercorrelations of the Self-Esteem, Social Par- 
ticipation, and Tolerance scales for the scales prepared by each of the 
psychology students, as well as the corresponding scales taken from 
the JPI and the CPI. Also included in Table 1 are the KR-20 reliabil- 
ity coefficients for the student scales. Looking first at the Self-Esteem 
scales, one notes that student scales were substantially correlated with 
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TABLE 1 
Intercorrelations among the New Student Personality Inventory Scales Especially Constructed for this 
Study and JPI and CPI Scales 


(N — 116) 
Scale Psychology Students 
1 2 3 4 5 6 7 
Self-Esteem 1 = 
2 76 = 
3 55 48 = 
4 73 .66 60 = 
5 42 Al 35 47 Е 
6 .53 44 42 62 46 - 
7 63 46 59 64 48 53 ч 
JPI .65 49 .67 67 .38 51 14 
CPI 45 46 29 48 134 23 32 
| КВ-20 reliability estimate 62 72 48 74 40 .50 .63 
Psychology Students 
: 8 9 10 п 12 13 4 
Social Participation 8 Ж 
9 42 с 
10 40 75 - 
11 19 23 26 - 
12 .30 49 40 16 - 
13 59 51 42 17 28 - 
4 .30 .68 65 38 53 43 - 
JPI 65 51 48 Е 24 
CPI 24 55 62 124 31 57 
KR-20 reliability estimate 48 60 68  —05 40 50 :62 
Psychology Students 
20:31 721 22 
осе E 15 16 17 18 19 
16 44 B 
17 24: 2080. - 
18 51 45 37 - 
19 34 36 30 47 - 
20 ВИ 54 43 - 
21 50 24 29 .38 20 2 n 
22 ; ; 23 ‚ЗОЖ Я 
JEI © "s 27 Е ОАО ШИ Зб 1:36 
с! 1305 ТА ОС NIA 
reliability estimate 65. 450016 E E ЗИ 34 


cach other and with the two corresponding formal personality scales, 
Particularly the JPI. Indeed, the correlations were high enough to sug- 
est а substantial general factor, since in a number of cases the scale 
Mtercorrelations actually exceeded the lower-bound KR-20 estimate 
Of reliability. The same observation holds for the Social Participation 
| апа Tolerance scales except that the scale reliabilities and intercorre- 
lations varied over a somewhat greater range of values for these scales. 
П по case were any of the correlations between any of the scales pro- 
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duced by student item writers negative. The general tenor of these find- 
ings indicates that scale definitions were communicated to student 
item writers, and that these definitions and item writing instructions 
were apparently sufficient to cause them to agree substantially on the 
characteristics to be measured. Even though item analytic procedures 
were not used, reliabilities were promising for l6-item scales. Fur- 
thermore, a review of the entire scale intercorrelation matrix for the 
Student Personality Inventory indicated that the different scales were 
mutually independent and that they formed three distinct clusters of 
Self-Esteem, Social Participation, and Tolerance. 


Analyses of Criterion Relationships | 


The two primary criteria used in the study were self-ratings and 
roommate ratings. Respective data for these two sets of criterion rela- 
tionships are contained in Tables 2 and 3. à 

It may be noted that self-ratings were differentially sensitive to the 
average validities for the three scales, as they were highest for the Self- 
Esteem scale and lowest for the Tolerance scale. However, consider- 
ing the reliabilities of the Student Personality Inventory scales and of 
the self-ratings, which were based on a single judge, these are indeed 
high correlations. For all three traits every one of the 22 Student Per- 
sonality scales correlated significantly with self-ratings. It is note- 
worthy that the more reliable scales developed by the psychology 
students were superior to the less reliable ones in terms of their corre- 
lation both with self-ratings and with roommate ratings. Similarly, 
females showed a tendency to be superior to males in the validity of 
the scales that they produced, an outcome paralleling findings in the 


TABLE 2 


Validity Coefficients as a Function of Scale-Construction Strategy Relative to the Selj-R" 


Criterion Measure 
(N = 116) 


Targeted Traits 


New Average Psychology Student 
SPI Most Reliable Psychology Student 
Scales Least Reliable Psychology Student 
Average Male Psychology Student 
Average Female Psychology Student 
Comparison Jackson Personality Inventory 3 47 У 3 
Scales California Psychology Inventory 43 3l 19 d 
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TABLE 3 
(йу Coefficients as а Function of Scale-Construction Strategy Relative to the Roommate Rating 
Criterion Measure 


(N 7 116) 
Targeted Traits 
SEs Soc.P Tol Average 
Average Psychology Student 27 29 48 (29 
Most Reliable Psychology Student 30 .34 20 .28 
Least Reliable Psychology Student 22 26 .16 2l 
Average Male Psychology Student 22 .28 16 22 
Average Female Psychology Student .29 33 20 cu 
Comparison Jackson Personality Inventory 30 E 23 29 
California Psychological Inventory 42 07 :08 .09 


social perception area indicating greater sensitivity of females to the 
implicit network of trait relationships (Lay, 1970). 

‚ Considering (һе relationships with standardized personality tests, it 
is noteworthy that for Self-Esteem and Social Participation the JPI did 
show higher relationships on the average both for self-ratings and for 
Toommate ratings criteria than did the Student Personality Inventory. 
For Tolerance, the JPI was superior in validity relative to roommate 
Tatings, but not self-ratings. The CPI did yield substantially lower 
validities than did the student questionnaire for both self-ratings and 
Toommate ratings. Particularly disappointing were the relationships 
between the CPI items and the targeted roommate ratings, which 
averaged only .09. It should be recognized, of course, that the items in 
this comparison did not comprise the entire set of items for each CPI 
Scale, but 16-item subscales chosen at random from the longer item 
Set. Similarly, it should be borne in mind that each of the JPI scales 
Was comprised of 20 items. Nevertheless, even if the longer CPI scales 
88 compared with the shorter ones produced somewhat higher valid- 
ities, this outcome indicates at least that the CPI scales are less efficient 
Шал are the student scales in predicting a criterion. If the JPI and CPI 
Scales are taken аз representative of standardized personality in- 
Ventories, it is quite clear that the students in the study can generate 
convergent validities in a comparable range to those of standardized 
Inventories. Considering the short length of the Student Personality 
Inventory scales, as well as the fact that no internal consistency ог 
Other item analytic procedures were applied, these validity coefficients 
Pe very favorably with those appearing in the literature for pub- 
ished personality tests which have used these and other empirical 
Procedures. 
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Analyses of Response Styles 


Many years ago (Cronbach, 1950; Jackson and Messick, 1958), it 
was recognized that response styles as distinguished from the logical 
validity of scales might contribute to their empirical validity while de- 
tracting from the measurement of the construct which the scale was 
designed to assess. In an effort to evaluate the possible role of re- 
sponse styles in the Student Personality Inventory, scales for ac- 
quiescence and desirability, both derived from a successive intervals 
scaling of the California Psychological Inventory, were included in the 
analysis. Since Student Personality Inventory scales proved io be 
largely independent of scores from the Acquiescence scale, these cor- 
relations are not reported. Table 4 reports the correlations between the 
average scales developed by the psychology students and the De- 
sirability scale. Also presented are the corresponding correlations for 
the JPI and the CPI. As might reasonably be expected, correlations be- 
tween Desirability and Self-Esteem are moderately high, less so be- 
tween Desirability and Social Participation, and least between De- 
sirability and Tolerance. Of the scales reported, the only ones in which 
a systematic effort was made to suppress desirability variance in the 
total score were the scales derived from the JPI. In the construction of 
the JPI, the component of the total score correlated with Desirability 
was removed by partial regression procedures, and the residual 
component, uncorrelated with Desirability, used as the basis for item 
selection. A variant of the Differential Reliability Index (Jackson, 
1967; Neill and Jackson, in press) was employed. This had the effect of 
subtracting the squared biserial correlation of an item with the De- 
sirability scale from the squared item-total scale biserial, where the 
total scale comprised the component uncorrelated with Desirability. 


TABLE 4 
Correlations with Desirability as a Function of Scale-Construction Strategy 
(N = 116) 
Targeted Traits 
SEs Soc. P Tol — Average 
New Average Psychology Student 44 33 8! B 
SPI Most Reliable Psychology Student 45 .42 Al 3 6 
Scales Least Reliable Psychology Student 43 22 413 m 
Average Male Psychology Student 48 3l A 34 
Average Female Psychology Student 43 47 14 е 
Comparison California Psychological Inventory 31 59 62 A 
Scales Jackson Personality Inventory 36 25 24 ji 
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ilts from the present analysis would appear to provide support 
procedure, at least for the Self-Esteem and Social Participa- 
„for which the JPI scales have shown the least high correla- 
the external desirability scale. But, in the case of the SPI 
should be recognized that correlations with Desirability were 
| not excessively high, especially when compared with those 
corresponding CPI scales. It would, of course, be appropriate 
le scales more widely, especially from the set of those with ex- 
els of desirability. But it nevertheless is of some considerable 
се to learn from the present study that novice item writers 
elop scales relatively free of response style effects. 


Discussion 


Ashton and Goldberg published their findings—startling to 
that the average graduate student in psychology in two hours 
yas capable of producing personality scales of equal reliability 
lidity to those developed by far more expensive and time-con- 
External strategies, they called for additional studies to ascer- 

generality of their findings to other samples of item writers, 
ed traits, and subject populations. In the present study the in- 
tor has sought to fill this requirement and, indeed, has demon- 
d that undergraduate majors in psychology can produce scales 
‘as high in validity as those prepared by the Ashton-Goldberg 
* students. The validity coefficients in the present study are 
ге striking when one considers that the reliability of the room- 
tings was attenuated by virtue of the use of only a single judge, 
“One well acquainted with the subject, rather than the larger 
employed by Ashton and Goldberg. Findings such as those 
ted in the present study and by Ashton and Goldberg can not but 
n the demise of the unquestioned ascendance of the External 
y of personality scale construction. In the past, rational or intul- 
les have been judged by some, including the writer, to be su- 
9 External scales on the logical grounds that they yielded less 
ous data, provided more systematic sample of relevant behav- 
ind offered less susceptibility to nuisance variables. Now, when 
to previous evidence (Hase and Goldberg, 1967) to the effect 
Predictions based on linear combinations of externally-derived 
scores were no more valid than were those based on linear com- 
Ons of intuitively-derived scale scores, new evidence exists that 
d ely unsophisticated psychology students can generate per- 
ity items possessing higher validities than those derived from 
У selected items based on an external criterion. With this kind 
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of evidence, there can be little or no justification for using the External 
strategy as the sole method of personality scale construction. Perhaps 
the only defense for using such a strategy would be in situations where 
one is in ignorance about the nature of the criterion. Even here, 
however, the present author would argue for a conceptual analysis of 
the criterion as an alternative to relatively blind empirical methods for 
discovering its components. Perhaps a further direction of research 
might be to examine the alternative benefits of a conceptual analysis of 
incompletely understood criteria as against the use of the External 
strategy in an exploratory context. 
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IMPROVING THE VALIDITY OF AFFECTIVE 
SELF-REPORT MEASURES THROUGH CONSTRUCTING 
PERSONALITY SCALES UNCONFOUNDED WITH 
SOCIAL DESIRABILITY: A STUDY OF THE 
PERSONALITY RESEARCH FORM' 


ROBERT D. ABBOTT 
California State University, Fullerton 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
1975, 35, 371-377. 


| Jackson's Personality Research Form (PRF) was investigated at 
the item and scale level with respect to the degree to which responses 
are confounded by social desirability. Jackson's usage of the differen- 
tial validity index resulted in large proportions of items neutral in so- 
cial desirability and scales relatively balanced in the number of items 
keyed for socially desirable and socially undesirable responses. These 


item and scale characteristics, the low correlations of PRF trait 
scales with social desirability scale scores, and the results of a compo- 
nent analysis of PRF trait scales and social desirability scales sup- 
Ported the discriminant construct validity of the PRF trait scales 
with respect to social desirability. 


М JACKSON (1967) has recently introduced the Personality Research 
Form (PRF). Forms АА and BB of the PRF are parallel forms 
Providing scores on 20 trait scales and on two stylistic scales. The 20 
trait scales were based upon Murray’s needs and the two stylistic 
Scales were designed to provide measures of nonpurposive respond- 
Ing and social desirability. Jackson's inclusion of a desirability scale 
In the PRF is but one indication of his efforts to reduce the confound- 
ing of Scores on PRF scales with the tendency to respond in a socially 
desirable manner (Edwards, 1957, 1970). Column 1 of Table 1 shows 
bd distribution of the absolute values of the correlations of the КЕК 
trait scales with the РВЕ Desirability (Dy) scale. А Epic enean cor 
igo ра version of a paper presented to the Western Psychological Association, Los 

а 55, alifornia, April, 1970. 
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these values with the correlations between Social Desirability (SD) 
scales and scales in other inventories such as the California Psycho- 
logical Inventory and the Minnesota Multiphasic Personality Inven- 
tory (MMPI) (Abbott, 1971; Edwards, 1970) has shown that the PRF 
scales are much less confounded with social desirability than are 
scales in other inventories. Jackson (1967, 1970) reported achieving 
this goal by using items with high "content" saturation with Jackson 
and Messick's (Jackson, 1970) differential validity index furnishing a 
quantitative measure of content saturation. 

The purpose of the present paper was to present a series of analyses 
of PRF items and scales to determine the apparent effects of the use 
of the differential validity index on more traditional item psycho- 
metric indices which have been used to reduce the confounding of trait 
scales and social desirability. Such information could be of potential 
value to research workers in personality test development who wish to 
employ a methodology based upon the differential validity index for 
enhancing the construct validity of a scale by minimizing response set 
confounding. 


Method 


Following directions reproduced in Edwards (1970), Group 1, con- 
sisting of 100 students, rated the Social Desirability Scale Values 
(SDSV) of the 440 items in Form AA of the PRF. The mean SDSV 
was obtained for each item. 

As part of an independent test research project, Group 2 (109 males 
and 109 females) followed self-description instructions and responded 
to the items in Form AA of the PRF, to items from the Edwards 
(1957) MMPI Social Desirability (SD) scale, and to items in Welsh's 
Repression (R) scale. 


Results and Discussion 
Item Level 


Edwards (1957) proposed that one way to reduce the effects of social 
desirability on responses to personality scales would be to use items 
which have SDSVs in the middle or neutral range of the SDSV con- 
tinuum. For large, relatively unselected, groups of personality items 
and constructs, the distribution of SDSVs has been shown (Cruse 
1965; Edwards, 1966) to be bimodal with the modes falling some 
where around 3 and 7 on the 9 point SDSV continuum. Figure | shows 
the distribution of the SDSVs of the 400 РКЕ trait items, and the dis- 
tributions of SDSVs reported by Edwards (1966) and Cruse (1965). 
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о о PRF (400 items) 
^_^ Edwards (2826 items) 
e———e Cruse (1647 items) 


PROPORTION OF ITEMS 


2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 


SDSV 
Figure 1. Distributions of SDSVs from three item pools. 


Figure | clearly shows that the proportion of items in the PRF with 
neutral SDSVs is larger than the proportion of items with neutral 
SDSVs in the Cruse (1965) list of personal constructs or in the 
Edwards (1966) list of items from which he constructed his Edwards 
Personality Inventory (Edwards, 1967). The usage of the differential 
validity index by Jackson has resulted in items which are relatively 
Neutral in social desirability. Given the generalizability (Edwards, 
1970) of SDSV mean ratings, the PRF items would probably be rated 
ІП much the same way by other groups. 


Scale Level 
ORE 2 and 3 of Table 1 show the frequency distribution of the 
absolute correlations of the 20 PRF trait scales with the Edwards SD 


Е у TABLE 1 | ; 
"едиепсу Distribution of Absolute Correlations of the 20 РКЕ Trait Scales with 


Three Desirability Scales 


Toy Tsp psp 
:50-.59 0 0 
,40-.49 о 1 0 
-30-.39 4 4 0 
.20-.29 5 5 7 
.10-.19 7 3 6 
00-.09 4 7 7 
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(Edwards, 1957) scale and the Balanced True-False (BSD) Social De- 
sirability scale that Edwards and Abbott (Edwards, 1970) constructed 
from the MMPI. In no case did any of the social desirability scales ас- 
count for more than 20% of the variance in a PRF trait scale. These 
results strongly replicate those of Jackson (1967) with his Dy scale. 

To investigate further the relationship between SD scale score and 
scores on the PRF scales, scores on the 22 PRF scales and Edwards 
SD scale were intercorrelated, factor analyzed by the principal compo- 
nent method, and the six factors with eigenvalues greater than one 
were rotated by Kaiser's varimax method. Table 2 presents all rotated 
normalized loadings greater than |.40| and the communality (h?) of 
each of the scales. It is seen that Dy, Jackson's measure of desirability, 
and SD, Edwards measure, both marked factor V, which accounted 
for less than 25% of any of the 20 PRF trait scale's common variance. 
This evidence not only shows that Jackson has succeeded in his at- 
tempt to minimize the importance of social desirability in the PRF but 
also strongly replicates research reported in his manual. 

Earlier studies by Edwards and his co-workers (Edwards, Diers, 
and Walker, 1962; Edwards and Walsh, 1963) have shown that several 


TABLE 2 
Rotated Normalized Factor Loading Matrix of the PRF and Marker Scales 

Scales I I Ш IV M VI 
Ab 43 —76 
Ас 91 
Af 90 -4l 
Ag 86 50 
Au 47 -80 
Ch 86 50 
Cs ES 
De 81 48 
Do 81 —43 
En 87 
Ex 

86 1 
Ha —80 i 
Im 95 6 
Ми 83 6 
Or -95 в 
PI 53 2! 49 43 5l 
Se 50 76 6 
Sr ы 60 59 D) 
Su 86 60 
Un 91 01 
In А 96 n 
Dy —89 D 
SD —98 


Note—Only loadings greater in magnitude than 40 are shown. Decimals have been omitted. 
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scale psychometric indices derived from the item SDSVs are predic- 
tive of the correlation of a scale with the SD scale and are thus helpful 
in predicting and interpreting the degree to which trait scale scores are 
confounded with the tendency to respond in a socially desirable direc- 
tion. These indices have included the imbalance (IMB) in the Social 
Desirability-Social Undesirability (SD-SUD) keying, to be defined 
shortly, and the proportion of items in a scale with neutral SDSVs, 
P(N). For MMPI scales, as IMB increases, the magnitude of the 
correlation with the SD scale increases, and as P(N) increases, the 
correlation with the SD scale decreases. 

For each РКЕ scale, Table 3 shows the distribution of SDSVs of the 
items in each scale, the correlation of the PRF scale with the SD scale, 
the imbalance in the PRF scale’s SD-SUD keying computed by sub- 
tracting .5 from the proportion of items in a scale keyed for a socially 
desirable response, as well as, the proportion of items in a scale with 
SDSVs between 4 and 6 on the 9-point SDSV continuum, P(N). These 
item characteristics provided much information about the correlation 
of trait scale scores with scores on the SD scale. P(N) was correlated 
—.460 with the correlation of a scale with the SD scale. Thus as the 
proportion of neutral items in a PRF scale increased, the correlation 
with the SD scale decreased. The degree of imbalance in SD-SUD key- 
ing correlated .48 with the correlation of a scale with the SD scale. 
Thus as the imbalance increased, the correlation of the scale with the 
SD scale increased. These findings, which replicate those with the 
MMPI, extend the usefulness of these item characteristics to items of 
nonpathological content taken from trait scales that have been de- 
signed to measure individual differences in “normal” personality 
traits. 

Jackson’s use of the differential validity index and of item selection 
techniques has resulted in scales consisting of a greater proportion of 
neutral items, as well as in scales which, in general, have a balance ІП 
their SD-SUD keying. This study has indicated that PRF scales that 
do have correlations with the SD scale are those with smaller propor 
tions of neutral items or with imbalances in their SD-SUD keying, 12%» 
the Dy scale. However, the relatively small confounding of variance In 
the PRF trait scales with the SD scale has lent support to the dis- 
criminant validity of the PRF trait scales. 
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, Two measures of motive to succeed were revised for administra- 
tion to female ninth grade students (№ = 71). Scores on Hermans’ 
(1970) Prestatie Motivatie Test (PMT) yielded a high degree of inter- 
nal consistency, comparable to that obtained with males, which was 
Breater than that found for scores on the present version of Mehra- 
bian’s (1969) Resultant Achievement Motivation (RAM) Test. In 
Separate validation analyses, scores on the PMT were observed to 
Correlate positively and substantially with each of two measures of 
School achievement and with questionnaire data on school-related 
attitudes and behavior. Although the correlations of the RAM scores 
with achievement measures were in the same direction, they were 
weaker than those for the PMT scores. The relationships between the 
two measures of motive to succeed and various internal causal 
ascriptions were different and low for the two instruments. 


IN order to apply achievement motivation theory to educational 
ушет, an objective measure of motive to succeed (Ms) is needed 
E ich сап be conveniently administered to groups of adolescents. 
E. 11918 important, the instrument must possess reasonable and 
ВЮ reliability and validity for both males and females, dM 
E s weakness of current projective tests and, in particular, о 
с Kis objective measures that are intended to represent the ame 
со teristics are their lack of reliability and their absence of ig 

"relations with each other (Weinstein, 1969). One implication of 
ens Mus gratefully acknowledge the assistance of Charles Clock, E 

h rancis Whittle of the West Hartford School District Гог their соорега 


tion i 
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Pyright © 1975 by Frederic Kuder 


379 


380 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


these findings is that objective tests may measure different aspects of 
the same construct. 

Some success has been reported with newer instruments not 
evaluated by Weinstein (1969). One is Mehrabian's (1968, 1969) Resul- 
tant Achievement Motivation (RAM) Test which has male (RAMm) 
and female (К.А Мі) subscales. Several studies present evidence for the 
validity of the subscales (Cohen, Reid, and Boothroyd, 1973; Farley 
and Mealiea, 1973; Mehrabian, 1968, 1969; Raffini and Rosemier, 
1972; Reid and Cohen, 1973, 1974; Weiner and Potepan, 1970). 
Although Mehrabian has considered the RAM to be a resultant 
measure of the Ms minus the motive to avoid failure (Maf), there has 
been some speculation that it may measure the Ms alone (Weiner and 
Potepan, 1970). In the present study both propositions are considered. 
A second promising Ms instrument is Hermans' (1970) Prestatie 
Motivatie Test (PMT) which though not designed for use with females 
has been shown to compare favorably with the RAMm subscale when 
male subjects were employed (Schultz and Pomerantz, 1974). 

The purpose of this study was to alter existing instruments so that a 
combined form of the RAMm and RAMf could be administered 
simultaneously to a mixed male and female population and so that 
both instruments could be given to a younger and more heterogeneous 
group of subjects than that represented by college students. The instru- 
ments were analyzed for reliability; validity was assessed by cor- 
relating scores on each of them with academic performance, with а 
measure of locus of control, and with scores derived from items on à 
school attitude and behavior questionnaire reflecting variables such 45 


educational aspirations and frequency of doing homework. 


Method 


One hundred and sixty-four subjects from two suburban junior high 
schools were randomly drawn from a large pool of ninth grade stu- 
dents and tested. Of these, 71 females were included in the following 
analyses. Two subjects who had not completed both test batteries Were 
dropped from the analyses. " 

Two batteries of tests were administered as a part of another project 
reported in more detail elsewhere (Schultz and Pomerantz, 197 
Several instruments were modified to improve their readability and © 
make them more relevant to a younger age group and to à school en- 
vironment. These included the following: Mehrabian's (1969) female 

1 Tt should be noted that 12 items in the КАМІ were altered to change either the a 
or sex orientation so that they could be administered to both sexes simultaneous | 


Revised versions of the RAM, РМТ, and DAS are available from Charles B. Scl 
Department of Education, Trinity College, Hartford, Connecticut 06106. 
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‚ of the Resultant Achievement Motivation Test (RAMf), which 
$ administered with the male scale but analyzed separately, 
ans’ (1970) Prestatie Motivatie Test (PMT), and the Debilitating 
ty Subscale (DAS) of the Achievement Anxiety Scale (Alpert 
aber, 1960). The Intellectual Achievement Responsibility (IAR) 
onnaire (Crandall, Katkovsky, and Crandall, 1965) was 
yed in its original form as a measure of locus of control. This in- 
ent was divided into the following four subscales of internal 
l ascriptions: Success to ability, success to effort, failure to lack 
ility, and failure to lack of effort. The subjects also responded to 
stionnaire of Likert scale items related to school attitudes and 
ior which reflect success striving, persistence, and achievement 
ior. (See Table 3.) The Comprehensive Test of Basic Skills 
) served as one index of school achievement. 
t batteries were administered in two sessions to groups of ap- 
mately 25 subjects. The RAMf, DAS, and the school attitude and 
ior questionnaire were completed during the first testing period 
he PMT and IAR were completed approximately two weeks later 
р а second session. АП the items except the school attitude and 
lavior questionnaire were presented via 35 mm. slides projected 
0 a screen. The subjects read each item while simultaneously hear- 
tape-recorded reading of the item which was synchronized to the 
projector. The CTBS was administered approximately two 
hs earlier by school personnel independently of this investigation. 
Or each subject, three resultant achievement motivation scores 
е obtained. The first was the RAMf score which Mehrabian (1969) 
ribed as a resultant index of Ms-Maf. Since Weiner and Potepan 
0) suggested that the RAM may measure Ms alone, а second 
ant was computed by subtracting the DAS z-score for each sub- 
from her RAMf z-score; this difference score was labelled Mehra- 


Results and Their Interpretation 


Both Ms measures were analyzed for their internal consistency 
Onbach, 1951). This analysis yielded an alpha coefficient of .59 for 


Weiner and Potepan, 1970) divided his adult version of the [AR into 
Which y and effort or motivation. Crandall categorized all items other than 
s E reflect effort as simply undifferentiated, since these items may refer to more 
anal ity (personal communication). Crandall's subscales were employed in the pres- 

Чуве with the labels used by Weiner and Potepan for the sake of consistency. 
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the RAMf and .84 for the PMT. The greater consistency index for the 
PMT as compared with that for the RAMf approximates the соећ- 
cient of .82 obtained by Hermans (1970). These results are comparable 
to other findings with males in which the alpha coefficient for the 
RAMm was .55 and the alpha coefficient for the PMT was .91 (Schultz 
and Pomerantz, 1974). 

Resultant achievement motivation is presumed to be positively 
related to measures of academic achievement (Hermans, 1970). Ac- 
cordingly, the RAMf, RAMf-DAS, and PMT-DAS scores were cor- 
related with each of two measures of school achievement, teachers’ 
grades (GPA), and scores on a standardized test (CTBS). (See Table 
1.) 

Scores оп (һе РМТ exhibited a substantial relationship to academic 
achievement for females and compared favorably with similar correla- 
tions for males (Schultz and Pomerantz, 1974). When the resultant 
PMT-DAS score was correlated with achievement, the relationship 
was somewhat stronger than that for the PMT alone. Both 
relationships would appear to support the use of the PMT as an index 
of Ms. In contrast, the relatively low correlations between scores on 
the RAMf and school achievement would suggest that the RAMf has 
mild predictive validity whether conceived of as a measure of Ms or of 
resultant achievement motivation. Standing on the DAS was related in 
a predictably negative manner to level of achievement. However, à 
correlation no higher than that for the DAS was obtained from a com- 
bination of the two measures in the ВАМЕ-РА$ resultant. Since the 
КАМГ alone was not related to academic achievement and in com- 
bination with the DAS did not add to what was obtained from use of 
the DAS alone, the RAMf would appear to lack validity as a measure 
of Ms or of Ms-Maf. Furthermore, since the PMT-DAS and the 


у ТАВІЕ 1 | 
Correlations of Achievement Motivation Measures with Indices of School Achievemen 
(М = 69) 
Measures of 
Achievement Indices of School Achievement 
Motivation GPA CTBS 
PMT EZ .50*** 
PMT-DAS .60*** `61*** 
КАМГ Т 24% 
RAMI-DAS 405% ‚55*** 
DAS — Agree —.52*** 
*p« 05. 
** p «€ 0l. 


*** p < 001. 
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КАМЕ variables were nonsignificantly related (r = .08) they, 
therefore, could not be considered equivalent resultant indices. 

Measures of achievement motivation have been validated by 
relating them to causal ascriptions for success and failure (Cohen, et 
al, 1973; Weiner and Potepan, 1970) Achievement needs are 
presumed to be positively related to the attribution of success to ability 
and to effort as well as to the attribution of failure to lack of effort. 
Achievement needs are also presumed to be inversely related to the at- 
tribution of failure to lack of ability (Weiner and Potepan, 1970). 
Table 2 summarizes these relationships in the present study. Few of 
the predicted correlations were obtained. The PMT and the PMT- 
DAS variables were positively related to a significant degree (р < .05) 
to the attribution of success to ability, and the inverse relationship 
between the RAMÍ-DAS variable and the attribution of failure to 
lack of ability was statistically significant (p < .05). 

The case for the validity of the PMT with the female adolescents of 
the present study rested largely on its correlation with attribution of 
Success to ability. In this respect, the findings were consistent with 
both theoretical expectations and the most frequent previous findings. 
The case for the validity of the RAMf rests largely on its negative cor- 
relation with the attribution of failure to lack of ability. This 
relationship replicated earlier findings with undergraduate males and 
females (Weiner and Potepan, 1970) and with undergraduate males 
(Cohen et al., 1973). However, Cohen et al. (1973) failed to obtain а 
Similar relationship with undergraduate females, and Schultz and 
Pomerantz (1974) failed to find it with adolescent males. Neither 
measure of Ms was correlated positively with the attribution of failure 
to lack of effort. To the knowledge of the writers, this relationship has 
not been reported in any of the other studies in which measures of Ms 
Were correlated with the subscales of the IAR when slightly different 
Populations had been used. The present findings may reflect difficulties 


TABLE 2 
Correlations of Achievement Motivation Measures 
with Internal Causal Ascriptions (N — 69) 


Internal asures of Achievement Motivation 
ausal Ascriptions iE fe ВАМ RAME-DAS 
510 Ability 5 ы 3 16 
картоп B d n 09 
fiure ета for Success 26* 2 10 16 
ше » Lack of Ability 08 —04 —.19 7.25* 
olal 1 Lack of Effort 03 13 -.04 10 
nternal for Failure 07 06 -.14 -.07 


Lm 
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with the achievement motivation or attribution models as much as 
with the invalidity of the measuring devices. 

The validity of measures of achievement motivation has been asses- 
sed by relating them to questionnaires measuring achievement-related 
attitudes and behaviors (Mehrabian, 1969). Table 3 provides informa- 
tion on self-reports of academic achievement attitudes and behaviors. 
Most of the correlations which were significant were in the predicted, 
positive direction. Students scoring high on the various achievement 
motivation measures, as compared with those who earned low scores, 
tended to report higher educational aspirations, to place a greater 
value on grades, to indicate that their grades reflected their knowledge, 
and to report having higher grades and doing more homework. 
Among the measures of achievement motivation, the correlations were 
clearly highest when Hermans’ PMT was used as a measure of Ms. 
Generally the RAMf-DAS correlations were stronger than were those 
for the RAMf alone. 

Mehrabian originally designed the RAMf as a resultant measure of 
Ms-Maf. This instrument as currently modified for use with adolescent 
females was nonsignificantly related to the DAS (r — .03). If the 
RAMf was functioning as a resultant, it should be inversely related to 
debilitating anxiety because that factor is by definition a component of 
the resultant. Furthermore, the RAMf was only weakly related to the 
CTBS and was not significantly related to grades or to internal causal 
ascriptions. Alternatively, the RAMf might be considered a measure 
of the Ms alone as suggested by the low correlation with the measure 
of test anxiety. Across significant and nonsignificant results, there was 
à weak trend for the RAMf-DAS, rather than for the КА МГ alone, 10 


TABLE 3 
Correlations of Achievement Motivation Measures with Selected Items 
Лот the School Attitude and Behavior Questionnaire (М = 66%) er 


Self-Reported School Measures of Achievement Motivation pas 
Attitudes and Behaviors PMT PMT-DAS КАМГ RAME 
Educational Aspirations .33** 22 :29*. 2 
Importance of Grades .51*** .43*** 07 2 
Extent to Which Grades 
Reflect Knowledge 47 139% 11 ы 
Usual Grades 5]*** .56*** .09 3 
Amount of Homework Done .57*** .46*** —.02 ji 
Frequency of Doing 14 
Homework .43*** .33** .08 И 


jon 


ii 

* Three additional students were omitted from this analysis because they did not complete all items on the 09 
*p« 05. 
**p < 01. 
жер < 001. 
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relate more strongly with questionnaire items and internal causal 
ascriptions. Regardless of what the RAMf is considered to be, it did 
possess less validity than did the PMT for use with female adolescents 
as revised for this experiment. 

The literature on achievement motivation has made it clear that in- 
struments which attain a measure of success in assessing achievement 
needs for males have not typically worked for females. This difference 
may have been due to the different acculturation that males and 
females experience. One effect of different acculturations may be to 
render other motivations such as fear of success more important for 
females than for males. Therefore, measures of achievement needs 
may have met with difficulty. According to this line of reasoning, such 
instruments are used to measure something which may exist in small 
quantities, if at all. However, it is also possible that, although both 
males and females may acquire achievement needs, these needs are ex- 
pressed. differently because of different socialization processes and 
therefore must be measured differently. 

Some of the more recent studies suggest that the achievement 
motivation construct may be applicable to both sexes when measured 
by the Thematic Apperception Test (Simons and Bibb, 1974; Ollen- 
dick, 1974). Alper (1974) has speculated that because males may have 
Shown less interest in being achievers and females may have 
recognized that achievement is an appropriate trait for the female role, 
Previous findings of male and female differences in achievement 
motivation may have been mitigated. Although the present findings 
did not bear directly on these attitude changes, they were consistent 
With Alper's (1974) suggestions. They appeared to extend the recent 
trend of obtaining predicted effects with females on projective devices 
by yielding similar outcomes on an objective instrument. 

This result in particular was the case for the PMT which has 
emerged from the present analyses as a relatively reliable and valid ob- 
Jective measure of Ms, at least with a school-age female population. 
Moreover, to the extent that the validity of the PMT is sustained, it ap- 


Pears that male and female achievement needs can be measured in the 
Same Way. 
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The purpose of this study was to investigate the utility of the 
Edwards need achievement scale (n-Ach) for predicting achievement 
performance, as (a) a supplement to academic aptitude test, and (b) a 
predicator of over- and under-achievement. Subjects were 217 col- 
lege students enrolled in five sections of a general introductory psy- 
chology course. A correlational analysis was carried out among the 
following measures: Edwards n-Ach score, American College 
Testing Program Examination (ACT) score, overall grade point 
average (GPA), psychological course grade, and derived measures of 
оуег- and under-achievement. 

From the results of the study, the following conclusions were 
drawn: 

a. Little support for the use of the Edwards n-Ach scale as a 
Supplement to ability test scores in the prediction of 
academic performance was offered. 

b. The n-Ach scale was of little value in differentiating between 
оуег- and under-achievers. 

с. Further investigation is needed to evaluate a single course 
grade as an alternative to overall GPA as a suitable criterion 
of academic achievement. 


E applied psychologist is continually faced with the issue of the 
Beton of Scholastic success in academic institutions. i Soe 
osi and intelligence tests have proved quite useful for this predic- 

on. However, predictions based on these measures are not perfect. In 


fa ; і 
Ж they account for less than half of the variance in academic perfor- 
се. 
ER ug. 
сыен for reprints should be sent to Ronald В. Morgan, Loyola University or 
op a 820 North Michigan Avenue, Chicago, Illinois 60611. 
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Recent research interest in personality variables and in other non- 
intellectual factors has pointed to these elements as additional sources 
of variance in the prediction of academic achievement. The Edwards 
Personal Preference Schedule (EPPS) measures personality factors or 
“needs,” such as, Achievement and Endurance, which are logically 
related to academic performance. Therefore, the EPPS would appear 
to be useful in measuring personality variables in their relationship to 
academic success (Edwards, 1959). 

The practical value of a personality measure in the prediction of 
academic performance depends upon its ability to account for a por- 
tion of the criterion variance not predicted by an academic ability test. 
For example, Weiss, Wertheimer, and Groesbeck (1959) added n-Ach 
scores to academic aptitude test scores in a multiple regression equa- 
tion. With a sample of 49 undergraduate psychology students they 
found the coefficient of correlation with overall grade point average 
(GPA) was increased from .55 to .64. In a different approach to the 
same problem, Goodstein and Heilbrum (1962) obtained a .24 coeffi- 
cient of correlation between n-Ach scores and GPA after correlations 
due to difference in academic aptitude were partialed out (N = 206). 
Gebhart and Hoyt (1958) found n-Ach scores to be higher for "over 
achievers” than for “‘under-achievers.” "Over-achievers" were defined 
as students whose grades were substantially higher than were those 
predicted by academic aptitude test scores; the opposite condition 
defined “ипдег-асШеуег.” In a similar study, Krug (1959) confirmed 
the findings. 

The just described results are relevant to the construct validity of the 
Edwards n-Ach scale. Scores on the n-Ach scale should be positively 
correlated with academic achievement, when academic aptitude is held 
constant. A valid measure of n-Ach would be expected to have û 
significant degree of correlation with the amount of over-achievemen!. 
Achievement motivation as reflected in n-Ach scores, apart from intel- 
lectual ability or measured academic aptitude, would be anticipated t 
influence academic performance in a positive manner. 


Purposes 
The two major purposes of this study were to provide additional 
evidence concerning the efficiency of the Edwards n-Ach measure 25 / 
supplement to standard tests of academic aptitude in predicting 
academic achievement and to discriminate between over- and under 
achievers. This study was essentially a replication of an earlier ү 
done Бу Bachman (1964). Bachman found no increment in predictor 
of GPA when n-Ach scores were added to scores on a scholastle ар 
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titude test in a multiple regression equation and little success in pre- 

dicting over- and under-achievement from n-Ach scores. 

A subsidiary purpose was to examine an alternative to overall GPA 
| asthe criterion for studies of academic achievement. The development 
of adequate criteria is an endless problem in the study of academic 
achievement. The most frequently used and most often criticized 
criterion is the student's grade point average, taken over one or more 
semesters. The level of difficulty in different courses, different stan- 
dards applied by each instructor in his method of evaluation, and 
variation in courses under different instructors all lead to the incor- 
poration of unwanted variance in the criterion. 

These difficulties might be avoided by using grades assigned by one 
teacher in a single course. This study employed such an approach 
through using introductory psychology unit examination scores as the 
alternative criterion of academic achievement. 

It was hypothesized that 

а. The Edwards need achievement scale (n-Ach) would be a 
useful supplement to academic aptitude test scores. The multi- 
ple correlation (involving a weighted combination of the ACT 
and n-Ach variables) with GPA would be greater than the 
zero-order correlation between scores in the ACT and ОРА 
earned. 

b. The Edwards n-Ach scale would discriminate between over- 
and under-achievers. А valid measure of n-Ach would be ex- 
pected to have a significant level of correlation with the degree 
of over-achievement. 

с. Asa criterion of achievement, a variable of obtained grades in 
the general psychology course would yield higher correlations 
with individual predictors and composites of predictors than 
would overall grade point average (GPA). 


Method 


Subjects 


E college students enrolled in five introductory psychology 
en TE provided data for the study. The total number of students 
th rolled in these sections was 290. Thirty-four of these had not taken 
h ACT and 28 either failed to take the EPPS or withdrew from class. 
b „ш 217 students (135 males, 82 females) made up the total 
‘Umber of Subjects used in the study. 


| 
M easures 


Epp, ition to the previously cited measures of n-Ach from the 
» the ACT, overall GPA, and introductory psychology unit ex- 
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amination scores (based on mean T'scores for three unit examinations 
in General Psychology) (Psy.), a measure of over-achievement (А lera) 
was developed by subtracting predicted СРА from obtained GPA. 
The predictions employed the ACT regression equation from the ACT 
Research Service Report (АСТ, 1965). Thus, Alepa = obtained GPA 
— predicted GPA. The regression equation used was as follows: 


Predicted GPA = 0.376 + 0.046 ACTzwc. 
+0:025 АСТматн + 0.028 ACTsoc.s — 0.011 ACTw.scr. 


A similar measure of over-achievement (Alpsy.) was developed for 
psychology course performance. This entailed calculation of a regres- 
sion equation for predicting psychology test performance from the 
weighted combination of ACT battery scores. The resulting equation 
was as follows: 


Predicted Psy. = 31.66 + 0.0031 ACTene, + 0.3930 АСТмлтн 
+ 0.6143 ACTsoc.s. — 0.1528 ACTw.scr Mean T Score 


Statistical Analysis 


Product-moment coefficients of correlation were computed among 
the measures just listed. In addition, ACT and n-A ch scores were com- 
bined in multiple correlations with each of the two criteria of academic 
performance (GPA and Psy.). Partial "^s were computed for r (Alora) 
(ACT composite) With n-Arch removed and for r (Alp, ) (ACT composite) 
with n-Ach removed. 


Results 
Prediction of Academic Performance 


In Table 1 the correlations among predictors and criteria of 
academic performance are summarized. A negative correlation was 
found in all cases between n-Ach and the criteria of academic perfor- 
mance. None of the correlations reached the .05 level of significance: 

The addition of the n-Ach score to the ACT composite score ша 
multiple regression equation served to reduce the accuracy of predic- 
tion of both criteria. The decrease in correlation was not significant at 
the .05 level. 

The correlations of ACT composite and GPA with and without 7- 
Ach held constant were, respectively, .377 and .388. There was 10 
significant difference between the two correlation’ coefficients. 4 

Since the respective achievement indexes Algpa and Alpay. remove 
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TABLE 1 
Coefficient of Correlation between Predictors and 
Criteria of Academic Performance 


(М - 217) 
Criterion Variables 
Predictors GPA Psy. 
n-Ach —.023 —.028 
ACT 377 422 
ACT n-Ach(R) 142 1178 


all cases a two-tailed test of significance was used. 
lation between ACT and n-Ach was —.078. 


е accounted for by differences in aptitude, negligible 7's or 
Мапа —.050 were obtained between each of the achievement in- 
and the ACT Composite. The results indicated no prediction of 
ievement from the ACT Composite. 

problem of possible overlap of aptitude and n-Ach was in- 
by calculating partial r’s between the ACT Composite and 
f the two AI indexes with n-Ach held constant. Neither the cor- 
n of .048 involving the Algpa or the one of .053 for the Alpsy. 
Statistically significant at the .05 level. 


tiation of Qver- and Under-Achievement 


view of the negative findings for n-Ach, it was decided to evaluate 
LE PPS scales as possible predictors of differences in achievement. 
The coefficients of correlation between the А1сьл and each of the 15 
les of the EPPS varied from —.138 to .147, and the coefficients 
еп Alpay, and each of the 15 scales ranged from —.120 to .159. 
three of the coefficients reached the .05 level of significance. Cor- 
$ Of .147 and .159 between the Intraception Scale and Alara 
tween the Intraception Scale and Alrsy., respectively, were 
ally significant beyond the .05 level as was the correlation of 
'etween the Dominance Scale and Algpa. 


€ | indicates that in every instance the use of psychology grades 
terion resulted in higher coefficients of correlation than those 
ed using GPA. Relative to EPPS subscales, the coefficients of 
ation obtained using psychology grades as the criteria were not 
tly higher than were those obtained using GPA. 
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Discussion 


The results of this study offered little support for the use of the 
Edwards n-Ach scales as a supplementary predictor of academic 
achievement. Among several studies examined but not cited only the 
one by Weiss, Wertheimer, and Groesbeck (1959) presented evidence 
that the use of the n-Ach scale improved the prediction of academic 
achievement. It should be noted that their data were based on only 49 
subjects, whereas the present sample included 217, and that different 
academic aptitude test scores were used in the two studies. The results 
are consistent with Bachman’s (1964) findings, with the exception that 
nonsignificant and negative correlations were found in all cases 
between n-Ach and the criteria of academic performance. It should be 
noted that Bachman’s sample included only 61 subjects, while the 
present sample included 217 subjects. Bachman used SAT composite 
scores as a measure of academic aptitude, whereas in the present 
study, ACT composite and subtest scores were used as measures of 
academic aptitude. Thus, it would seem reasonable to conclude that 
the Edwards n-Ach scale is not a useful supplement to ability test 
scores in the prediction of academic performance and is of little value 
in differentiating between over- and under-achievers. 
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VALIDITY OF THE MMPI-168 FOR PSYCHIATRIC 
SCREENING' 


JOHN E. OVERALL? 
University of Texas Medical Branch, Galveston 


JAMES N. BUTCHER 
University of Minnesota, Minneapolis 


SARA HUNTER 
University of North Carolina, Chapel Hill 


Validity of an abbreviated 168-item administration and the stan- 
dard MMPI was compared with reference to discriminating psychia- 
lric patients from normal college students. Better discrimination 
Was obtained from clinical scale scoring than from factor scoring. 
The abbreviated MMPI-168 actually produced slightly better dis- 
Crimination than did the longer parent instrument. Revised equa- 
lions for converting MMPI-168 scores to conventional MMPI 
validity and clinical scale scores are presented. 


f THE purpose of this study was to compare the discriminant validity, 
Or general psychiatric screening, of an abbreviated 168-item ad- 
Ministration of the Minnesota Multiphasic Personality Inventory 
(MMPI) with that of the standard 373-item short form. A secondary 
[Pose of this article was to provide new and improved equations for 
суета Scores derived from an abbreviated administration to 
inen MMPI clinical scale scores. | 

». and Gomez-Mont (1974) have presented evidence that much 
се 5 reliable variance of the standard MMPI clinical scales is con- 
| rated in the first 168 items, A procedure for estimating conven- 
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tional clinical scale scores from raw scores obtained by applying stan- 
dard MMPI scoring stencils to the first 168 items was described. 
Overall, Hunter, and Butcher (1973) investigated the factor structure 
of the first 168 items in an effort to understand further the content of 
the abbreviated version. Factors representing Somatization, Depres- 
sion, Low Morale, Psychotic Distortion, Acting Out, plus an Mf 
Feminine Interests factor were identified, and items constituting factor 
scoring keys were reported. Analyses of the larger ММРІ item pool 
were undertaken by Hunter, Overall, and Butcher (1974) to verify that 
the structure was not appreciably different from that of the first 168 
items. As a result of these investigations, factor scoring procedures 
purported to represent the same basic dimensions of psychopathology 
were made available for the 373-item MMPI short form and for the 
168-item abbreviated short form. Thus, in the present investigation, it 
was possible to compare the discriminant validity of the abbreviated 
and standard administrations both in terms of clinical scale scoring 
and factor scoring. 

Because the writers have envisioned an important use of the 
MMPI to be psychiatric screening in ostensibly normal populations, 
the ability of the instrument to discriminate psychiatric patients from 
normal college students seems a reasonable basis for comparison of 
the validity of different scoring procedures. Although the psychiatrie 
sample considered in this study was not matched to the college sample 
in age or social class, it seems appropriate to assume that the primary 
source of difference in MMPI clinical scale score and factor scot 
profiles should be the degree of psychopathology. 


Method 


A mixed clinical consisting of 431 subjects including neurotics, PSY” 
chotics, personality disorders, alcoholics, and drug abusers was 0 
tained from a state hospital, from diagnostic referrals of раға 
patients in a university hospital, from an inpatient alcohol жаш 
unit, and from an outpatient drug rehabilitation unit. Males 0" 
numbered females approximately 3 to | in the psychiatric sample. 
normal comparison group was obtained by randomly sampling, ІП ар 
proximately the same sex ratio, 400 MMPI records from a larg? ps 
lege student sample obtained in group testing at Bethel College (Mi 
nesota). 

The MMPI item response protocols were computer sco 
ing to four procedures: (a) standard clinical scale scoring. 
scale scoring based on the first 168 items, (c) factor ѕсогіп, 


g ase в 
the 373-item short form, and (d) factor scoring based on the first 
| 
| 


red accord 
(b) clinic? 
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items. For each type of scoring, means and standard deviations were 
calculated for each variable from the normal college sample alone, and 
all scores were transformed to T-scores based on the college norms. 

Assuming that the psychiatric sample should differ from the college 
sample by evidencing higher levels of psychopathology, the first 
analyses involved a simple determination of the number of subjects in 
each group who had one or more scores elevated above T-score of 70 
according to each of the four scoring procedures. Because of the mixed 
sex composition of the samples, Mf scales or factor scores were not 
considered in these analyses. 

Next, the method of linear discriminant function analysis was used 
lo determine the degree of separation of the normal and psychiatric 
Samples, Cutting points were selected to minimize classification errors 
in the sample data. To evaluate possible shrinkage, the discriminant 
function analyses were repeated with the samples split at random into 
Primary and cross-validation samples. The discriminant function 
derived from the primary samples was used to assign subjects in the 
cross-validation samples. 


Results 


The frequencies of classification based on observation of one or 
more elevated scores (Т > 70 in college population) are presented in 
Table | for factor scoring and clinical scale scoring based on 168 items 
fd on 373 items. Because the Т-зсогез were normed on the college 
оп, the use of T > 70 аз a cutting point resulted in fewer mis- 
E ‘cation of normals than of psychiatric patients. Also, because 
ton Were fewer factor scores than clinical scale scores on which eleva- 
E occur, the proportion classified as abnormal according to 
differe scoring was lower than for clinical scale scoring. These 
terest nces, however, should not affect the comparisons of primary in- 
on the | this Investigation, The comparability of classification based 
standa. "ESI abbreviated short form with that obtained from ШЕ 
Es 373-item short form is impressive. Probably as а matter о 
tually m rase particular data, the abbreviated MMPI-168 scoring А 
With Кү ded fewer errors of classification than did the ММРЇ37 
“Lor scori factor Scoring and clinical scale scoring. The validity of fac- 
Simple re and clinical scale scoring appeared comparable also, when 
е conti 70 scale elevation was used as the criterion ибо. 
Шошр m Ingency coefficients relating MMPI classification to re 
| E dinem Ship were essentially identical (.46, .47, 45 and .45) for 
ext asis of results shown in Table 1. 
* а simple discriminant function analysis was performed on 
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TABLE 1 
Frequencies of Correct and Incorrect Classification Based on One or More 
Elevated Scales from Four Different Scoring Procedures 


Clinical Scale Scoring 373 Items 


College Sample Clinical Sample 
Normal 332 136 
Abnormal 68 295 
Clinical Scale Scoring 168 Items 

College Sample Clinical Sample 
Normal 331 130 
Abnormal 69 301 
Factor Scoring 373 Items 

College Sample Clinical Sample 
Normal 349 167 
Abnormal 51 264 
Factor Scoring 168 Items 

College Sample Clinical Sample 
Normal 351 167 
Abnormal 49 264 

s used 


each of the sets of scores? The computer program that wa 
produced a percentile frequency distribution of scores on each OP: 
timally weighted composite (discriminant function) for the two 
groups. A cutting point was selected by inspection to minimize classifi- 
cation errors in the sample data. Results are presented in Table 2 for 
the four types of scoring. 
Contrary to the results obtained from classification based on single 
scale elevation (Table 1), the discriminant function approach yielde 
considerably better results when applied to clinical scale scores than 5 
factor scores. It is also apparent that the discriminant function a 
proach applied to clinical scale scores did yield considerably f^ 
classification errors than did the criterion of single scale elevation 1 
plied (о the same type of scoring. Of primary concern in this investigi 
tion, the discriminant function classification based on the 168-item ^ 


° The specific discriminant function coefficients are not reported here bi 
timum combination of clinical scale scores depends on the psychiatric sampl 
criminated. This study was addressed to the question of which type of scoring 3 4 com 
expected to yield best discrimination, not the definition of a particular weighte 
posite for general use. 


ould be , 
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breviated short form was as accurate as the classification based on 
| standard 373-item short form scoring. 

To confirm that the superior results obtained for clinical scale scor- 
| ing were not due to capitalizing on chance relationships among the 
larger set of clinical scale scores, as compared with factor scores, the 
samples were subsequently split at random into primary and cross- 
validation samples. Discriminant functions were fitted in Sample A for 
both factor scoring and clinical scale scoring, and then the accuracy of 
classification was assessed in Sample B. The shrinkage observed іп the 
cross-validation samples was approximately 2 per cent. Because the 
cross-validation results were so similar to those cited in Table 2, their 
Presentation in separate tables seems unnecessary. 

, Having verified high discriminant validity for clinical scale scores es- 
timated from an abbreviated 168-item administration of the MMPI 
through use of regression transformations previously derived by 
Overall and Gomez-Mont (1974) from a relatively small sample of 
Psychiatric patients, the writers calculated new and more раг- 
Simonious regression transformations from the larger combined psy- 
chiatric and college samples of this investigation. To facilitate clinical 


Р : TABLE 2 ў 
requencies of Correct and Incorrect Classification Based on Discriminant 
Functions from Four Different Scoring Procedures 


Clinical Scale Scoring 373 Items 
College Sample Clinical Sample 


Normal 
57 
Bs % 374 
Clinical Scale Scoring 168 Items 
N College Sample Clinical Sample 
orma] T 
0 
Abnormal 5 | 
Factor Scoring 373 Items 
College Sample Clinical Sample 
fus 118 
Abnormal a 313 
Factor Scoring 168 Items 
No College Sample Clinical Sample 
тта 
САДЕ. E 
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use, the new regression transformations were calculated to go from 

each individual MMPI-168 raw score to the estimated corresponding 

MMPI raw clinical scale score without additional scales being in- 

cluded in the equations. The variances of the regression transforma- 

tions were also adjusted to equal the variances of the standard MMPI 

raw scale scores in the composite sample under consideration. 

Without such adjustment, the variances of least squares regression es- 

timates are smaller than the variances of the criterion scores. The 

"stretching" of variances has no effect on the linear correlation 

between predicted and observed scores, but it is important if one is g0- 

ing to use existing T-score norms for interpretation of the clinical scale 
scores derived from an abbreviated administration. The regression. 
transformations for MMPI-168 raw scores (obtained by applying stan- 
dard MMPI scoring stencils to the first 168 items) are presented in 

Table 3. The product moment correlations between the MMPI-168 

and MMPI-373 clinical scale scores observed in this sample are shown 

in the right-hand column. 


Regression Estimation of RUE Seve Scores from 168-Item MMPI 
1 1.291 + 0.31 
Р-1.76Р + 1.63 
K-1.90K + 2.21 
Нв = 1.39 Не + 0.67 
D-1.26 D + 4.60 
Hy = 1.37 fy + 6.86 
Ра = 1.37 Ра + 5.66 | 
Mf = 1.78 Mf + 5.17 
Pa = 2.15 Ba + 1.50 
Pt = 2.39 Pt + 0.82 
Se = 3.52 Sc + 3.25 | 
Ма = 1.78 Ma + 1.44 | 
Ši = 3.78 Si + 1.40 
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Discussion 


Answers to two principal questions were sought in this investiga- 

tion: Is the psychiatric screening utility of an abbreviated 168-item ad- 
| ministration of the MMPI approximately equivalent to that of the 
| longer conventional MMPI? Is the psychiatric screening utility of fac- 
tor scoring superior to that of more familiar clinical scale scoring? The 
answers appear to be that the abbreviated MMPI-168 has the potential 
for providing just as valid general psychiatric screening results as does 
the considerably longer standard MMPI administration and that the 
factor scoring examined here does not appear to offer potential 
| superior to clinical scale scoring for the purpose of general psychiatric 
- Screening. 
Much more work must be done to evaluate the clinical utility of a 
| shortened administration of the MMPI. At this point in time, evidence 
appears adequate to confirm that most of the reliable variance in the 
standard MMPI clinical scales is accounted for by the first 168-items 
(Overall and Gomez-Mont, 1974), that the content domain spanned 
by the first 168 items is equivalent (as judged by factor structures) to 
that of the larger item pool (Overall, Hunter and Butcher, 1973; 
Hunter, Overall and Butcher, 1974), and that scale scores derived from 
the first 168 items correlate quite highly with clinical scale scores 
derived from the larger item pool (Hedlund, Powell, and Cho, 1974; 
Newmark, Newmark, and Cook, 1975; Newmark and Raft, 1975). 
That the clinical scale scores derived from the abbreviated administra- 
tion have potential equivalent to those derived from a longer standard 
administration for general psychiatric screening appears documented 
їп the present study. 

In the further evaluation of the MMPI-168, the authors would 
үзе that the conventional longer ММРІ can serve only as а 
ted and imperfect criterion. Comparisons should consider the 
Onger and the abbreviated versions as alternative test instruments. 
ee Т Шоп results do not always agree, it cannot be assumed 
fh the © longer MMPI is always correct. The use of external pape 8 
paris P Sent investigation, appears to offer an advantage for com 

"son of alternative forms without assuming that either is the 
Ultimate criterion. 
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| COMPARISON OF THE STANDARD MMPI AND THE 
INI-MULT IN A UNIVERSITY COUNSELING CENTER 


R. B. SIMONO 
University of North Carolina at Charlotte 


One hundred ten undergraduates were administered the standard 
Minnesota Multiphasic Personality Inventory which was scored for 
the standard profile and for a shortened version of the same inven- 
tory. The study was designed to explore the usefulness of a short ver- 
sion of the ММРІ in a university counseling center. Correlations 
Were obtained between corresponding scales on both forms for males 
and females separately. Although Pearson product-moment correla- 
tions for both sexes were statistically significant, they were not of the 
magnitude to predict scale scores on one form from the other. In 
addition, an examination of the profiles suggested that the short ver- 
sion could not provide clinical data comparable to those of the stan- 
dard form. Implications for further research were made. 


E college or university counseling centers, the standard MMPI is 
еп used as a diagnostic and research instrument. However, psychol- 


‘hen asked to respond to the standard MMPI. ; 
Kincannon (1968) took an abbreviated form of the MMPI (which 

titled the “Mini-Mult”) consisting of 71 items and compared it to 
16 standard MMPI with a sample of 50 male and 50 female ad- 
sions to a psychiatric unit. Included were MMPI scales | to 4, 610 
BE. L, F, and K. For the two sets of raw scores on the ММ 
ith E MMPI, product-moment correlations ranged from .80 to .93, 
1 the median correlation being .87. A second comparison between à 
ilar group of 25 males and 25 females, resulted in correlations 
ch ranged from .70 to .96 with a median of .87. 
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Lacks (1970) replicated the findings of Kincannon with 50 males 
and 44 females who were in-patients in an acute intensive treatment 
facility. Correlations between the comparable scales of the MMPI and 
the Mini-Mult ranged from .68 to .89 with a median of .83. In a study 
by Harford, Lubetkin and Alpert (1972) correlations between corre- 
sponding scales on the Mini-Mult and the MMPI ranged between .21 
to .81 with a median r of .54. When relating the scales оп the MMPI to 
those of the Mini-Mult, Armentrout (1970) obtained statistically sig- 
nificant correlations in all scale comparisons except for males on the F 
scale. For males r’s ranged from .09 to .73 and for females from .42 to 
.85. In this study a comparison of profile peaks led to the conclusion 
that *Mini-Mult profiles did not permit prediction of the one or two 
most evaluated clinical scales" on the MMPI. 

The present study was designed to explore the usefulness of a short 
version of the MMPI in a university counseling center as well as to de- 
termine whether earlier results of investigations of the Mini-Mult 
could be replicated with a sample of college males and females demon- 
strating no gross abnormality. 


Method 


The subjects for this study were 110 male and 45 female undergrad- 
uates who were clients іп a university counseling center which offered 
general psychological services to a population of approximately 5,000 
students. When counseling was initiated, the clients were ad- 
ministered the standard MMPI. The ММРГ$ were scored for stan- 
dard profile (K-corrected) and Kincannon's scoring procedures were 
used to obtain Mini-Mult scores on the same set of data. 


Results 


Table 1 shows the product-moment correlations of the comparable 
scales for both the standard MMPI and the Mini-Mult. The correla 
tions between the scales for the male sample yielded values ranging 
from .03 to .79, with a median correlation of .60. Although 10 of the 
correlations were significant (p < .01), only five scale correlations (К, 
D, Pa, Pt, Sc) were high enough that one-half of the variance of one 
scale was accounted for by that of the other scale. 

For the sample of females, correlations between scales ranged f! 
.18 to .85 with a median correlation of .71. Although nine of the cole 
relations were significant (p < .01), only six of the scale correlations 
(D, Pd, Pa, Pt, Sc, Ma) were such that one-half of the variance of one 
scale was accounted for by that of the other scale. 


rom 


К. В. SIMONO 403 


TABLE 1 
Correlations of the Corresponding Scale Scores for the 
Standard MM PI and the Mini-Mult 


Females (N = 45) Males (М = 110) 
Scale r r 
L 594% 39** 
F .54** 212% 
К 63 71% 
Ну .33 03 
D 171% 75** 
Ну 418 34** 
Ра .81** 60** 
Pa AP wei 9 
Pt E 79** 
Sc .85** 76** 
Ма Ж 2947 


"p< 0l, 


Because the MMPI is often used in a therapy setting to generate or 
confirm clinical hypotheses, it seemed important to explore the degree 
to which the Mini-Mult could predict various indices оГ рег- 
Sonal/social adaptation problems in a college population. The method 
used, which was developed by Drake and Oetting (1959), was de- 
Signed to help develop hypotheses from MMPI patterns. No scale was 
coded high unless it had a T-score of 55 or above. Only the three high- 
est scales were selected to represent the high coding of the profile. 
After the three highest scales were chosen, the numbers of the scales 
were arranged in numerical order regardless of the relative size of the 
T-scores. This procedure was followed for both the standard MMPI 
м the Mini-Mult in order to obtain a comparison of profile peaks оп 
the two comparable forms of this personality inventory. 

After looking at the profile pairs for females according to the Drake 
Dum Oetting procedures and noting only the highest three scales 
еи of relative size of T-scores, the investigator found that 15 in- 
two bw: had three scales in the same order on both forms, seven had 
whi Scales in the same order in each profile, and 14 had one scale 

ph showed up high in one profile and also in the other. Nine female 
ШО Pairs showed no similarity in profile peaks through using the 
Bias ¢ and Oetting procedures. The 110 male profile pairs were then 
It Таза also through employing the Drake and Oetting procedures. 
ас found that seven pairs had three scales іп the same order оп 

form, 45 pairs had two scales in the same order in each form and 
DM had one identical scale in each form. Twenty-three pairs of 
es showed no relationship to each other. 
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Discussion 


The results of the present investigation of the comparability of the 
standard Minnesota Multiphasic Personality Inventory and the Mini- 
Mult were dissimilar from those found by Kincannon (1968) and 
Lacks (1970) but, similar to those obtained in the comparability 
studies of Harford et al. (1972) and Armentrout (1970). 

Although there were significant correlations found between the 
scales of the MMPI and corresponding scales on the Mini-Mult, the 
results reinforced doubts that the Mini-Mult could be used in a Uni- 
versity counseling center to provide clinical data comparable to those 
available from the standard MMPI. Neither the present study nor 
Armentrout’s (1970) work approached the level of comparability seen 
in the work of Kincannon (1968) and Lacks (1970). Reinforced was 
the suggestion made by Harford et al. (1972) that the comparability of 
the two forms is enhanced when working with a “тоге pathologically 
severe population.” 

Although the Mini-Mult did not provide profile peaks equilivant to 
those in the standard version of the MMPI, it did furnish personality 
data which met criteria used to indicate emotional tension or difficulty 
in personal adaptation as measured by the MMPI. Because measures 
such as the Mini-Mult offer brief methods of personality assessment, 
additional research as well as clinical observation might determine the 
validity of the Mini-Mult as a separate personality measure. 
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THE FACTORIAL VALIDITY OF 
THE PIERS-HARRIS CHILDREN'S SELF-CONCEPT 
SCALE FOR EACH OF THREE SAMPLES OF 
ELEMENTARY, JUNIOR HIGH, AND SENIOR HIGH 
SCHOOL STUDENTS IN A LARGE 
METROPOLITAN SCHOOL DISTRICT 


WILLIAM B. MICHAEL AND ROBERT A. SMITH 
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, For each of three samples of 299 elementary school pupils, 302 
Junior high school pupils, and 300 senior high school students in a 
large metropolitan school district, factor analyses of the intercor- 
relations of the responses to 80 items in The Piers-Harris Children's 
Self-Concept Scale yielded three major dimensions that were 65- 
sentially invariant across the three samples: (a) physical appearance, 
(b) socially unacceptable (bad) behavior, and (c) academic or school 
Status, Variance within a complex domain of emotionality was 
differentiated among a number of factors such as anxiety, abase- 
ment, self-contentment, and self-dissatisfaction that were not in- 
Variant across samples. In both the junior high school and senior 
high school samples identifiable factors of popularity and perceived 
Psychomotor coordination appeared. Implications for writing of 


PER represent the constructs associated with self-concept are dis- 
Sed. 


| а ON Several self-concept measures available to researchers and 
| (Pi ol personnel The Piers-Harris Children's Self-Concept Scale 
gr Sts and Harris, 1969) was evaluated by Shreve (1973) to show the 
et Promise according to criteria posed in the Technical Stan- 
1966) pe Educational and Psychological Tests (French and Michael, 

).In their manual Piers and Harris (1969) reported a factor analy- 
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sis of the scale of 80 yes-no items that had been administered to a sam- 
ple of 457 sixth grade pupils. Six factors were tentatively identified as 
"undesirable or Баа” behavior; intellectual and school status, physical 
appearance and attributes, anxiety, popularity, and happiness and satis- 
faction. 

The two-fold purpose of this investigation was to determine for each 
of three samples of 299 elementary school pupils, 302 junior high 
school students, and 300 senior high school students in one of the na- 
tion's largest metropolitan school districts the factorial dimensions of 
The Piers-Harris Children's Self-Concept Scale and to ascertain 
whether the constructs associated with the six factors reported by Piers 
and Harris could be replicated. Determination of the factorial struc- 
ture of this scale could enhance its utility in the identification of the 
measurable constructs underlying the complex entity called self-con- 
cept, could provide an improved basis for use of the instrument in 
diagnostic and counseling activities, and could suggest possible re- 
visions in or additions to items in the scale. 


Methodology 


Intercorrelations (phi coefficients) of the responses to the 80 items 
were found for each of the three samples, and in each sample а 
varimax factor rotation was undertaken of all principal components 
with eigenvalues in excess of unity (Dixon, 1969). 

In general, with one or two exceptions, a factor was identified and is 
cited in this paper whenever it yielded a loading of at least .60 on one 
item, of at least .50 on a second item, and of at least .30 on each of two 
or more other items. 


Findings 


_ The results of the investigation are reported in terms of (a) the cita- 
tion of three major factorial dimensions that were essentially 
replicated across the three samples studied in three of the key dimen- 
sions reported by Piers (1969), (b) a description of the factorial struc- 
ture of the broad but central domain of emotionality, and (c) а brief 
exposition of other factors of secondary significance or concern. 


Factors More or Less Invariant across Samples 


Across each of the three educational levels the three following fac- 
tors appeared to be relatively invariant: physical appearance, socially 
unacceptable (bad) behavior, and academic competence reflecting 
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school and intellectual status. For each of the groups representing a 
different educational level, a single well defined factor regarding 
physical characteristics emerged. Relative to the construct revealing 
socially undesirable behavior only one factor appeared for the ele- 
mentary and junior high school groups, but for the senior high school 
sample two factors resulted—the second one reflecting to a large ex- 
tent home difficulties. In the instance of the academic dimension, 
however, one factor indicating mostly observable activities or behav- 
iors in the sets of items and a second factor involving sets of items re- 
vealing primarily passive or reported subjective feelings appeared at 
each of the three educational levels. The factor analytic results indicat- 
ing loadings on individual test items at least equal to .30 for the 
physical appearance dimension may be summarized for the ele- 
mentary school (ES), junior high school (JHS), and senior high school 
(SHS) samples as follows: 


Factor I—Physical Appearance I (ES) 1(JHS) 1(SHS) 
54 Тат good looking. 75 49 75 
60 Іһауеа pleasant face. 71 43 - 
73 Іһауеа good figure. 68 - 40 
4| Ihave nice hair. 67 ло 66 
69 Тат popular with girls. .58 41 45 


29 Ihave pretty eyes. 58 - ТƏ 
15 Тат strong. - 158 - 


57 Тат popular with boys. — $57 52 
21 Тат good in my school work. 54 - - 
5 Iamsmart. 50 - 35 
8 My looks bother me. - 25 -.46 
49  Myclassmates in school think 
I have good ideas. 40 SF =. 
33 My friends like my ideas. .39 .36 = 
36 Тат lucky. — 31 - 
70 Tam а good reader. 35 = ү 
27 Tam an important member of 
my class. 35 — — 
52 Тат cheerful. .34 - - 
80 Тата good person. 34 gi ST 
9 When I grow up, I will be an 4i 


important person. 


For the dimension of socially unacceptable, or so-called bad, be- 
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havior, which included the appearance of two factors at the senioi 
high school level, the following outcomes were observed for each of 
the three school groups: d 


Factor II—Socially 


Unacceptable Il Il IIA ИВ 

(“Ваа”) Behavior ‚ (ES) (HS) (SHS) (SHS) 
22 Ido many bad things. .69 39. 54 — 
34 Іоһеп get into trouble. .50 — 7.65 — 
48 lam often mean to other 

people. .64 57 49 = 
56  Igetinto a lot of fights. .54 66 aye = 
12 Тат well behaved inschool. —.44 - -.48 - 
78  Ithink bad thoughts. 42 45 33 = 
14 I cause trouble to my | 

family. 436 55:33 — E 
68 I lose my temper easily. 37 — = a 
13 It is usually my fault when ) 

something goes wrong. — — 42 — 
38 My parents expect too much 

of me. — 41 — = 
25 I behave badly at home. 37. 436 - 64 
62 Tam picked оп at home. odi] — 67. 
59 My family is disappointed 1 

іп me. ed 48 йы 57 
44 [sleep well at night. ПОРЕ — : -—.36 ЖЧ 


61 When I try to make some- 
thing, everything seems to 
go wrong. - се T3 36 


For what appeared to be primarily activity-oriented items asso- | 
ciated with academic competence, the factor analytic results relative to | 
each of the three samples at differing educational levels were as fol | 
lows: 


Factor ША--Асайетіс or 
School Status Embodying ША ША ША 
Many Activity-oriented Items (ES) (JHS) (SHS) | 


42  loften volunteer in school. .65 = 58 | 
36 Ісап give а good report in 
front of the class. .53 :61 .66 
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49  Myclassmates in school think 


I have good ideas. 53 34 — 
16 I have good ideas. 37 - — 
33 My friends like my ideas. 134 - - 
7 I get nervous when the teacher 
calls on me. — — —A2 
12. Тат well behaved in school. - 189 — 
75 Гат always dropping ог 
breaking things. — 15135 — 
63 lama leader in games and 
sports. 31 - = 
36 Iam lucky. — 34 > 
66  Iforget what I learn. — — -.38 
6 Iamshy. Суй S =.34 


In the instance of items that tended to reflect somewhat passive and 
subjective states of feeling regarding school competence in the ab- 
sence of much overt activity, the factor analytic data relative to eachof ' 
the same three groups representing differing educational levels were as 
follows: 


Factor ШВ--Асайетіс or 
School Status Reflecting Sub- 


jective Feelings Rather ШВ ШВ ШВ 


Than Overt Activities (ES) (JHS) (SHS) 
2 in finishing my school 
4 ШЕРТ, в) 26121 CEBU 7,67 
70 Tam a good reader. " .59 - 3l 
21 Tam good in my school work. — 42 { 
5 Тат smart. - 79 UN 
61 When I try to make some- 
thing, everything seems to 50 L5 
wrong. — 2 . 
38 My parents expect too much 
of me. -.45 — — 
22 I do many bad things. "E — D 


66 I forget what I learn. 


Factors Pertaining to Components of Emotionality 


Relative to the factors that arose from clusters of mainly oe 
iorally stated items dealing with such components of emotional ity as 
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anxiety, self-depreciation (abasement), self-actualization (self-satis- 
faction or happiness), and status or power needs the factorial picture 
was not only somewhat ambiguous or ill-defined, within each of the 
three samples at varying school levels, but also quite variable from one 
school level to another. 4 
For the elementary school group, two rotated factors (Гапа V)—the / | 


first one being made up of more behaviorally stated items than was the 
second one—appeared to be sufficiently well described to permit 
respective interpretations of self-depreciation and of anxiety embody- 
ing withdrawal and alienation tendencies (although the patterns of | 
three or four item loadings on essentially a residual factor not to be | 
cited also suggested the possible presence of alienation accompanied | 
by ineptness, hypochondriasis, and egocentrism): 


Factor IVA (ES)—Self-Depreciation (Abasement) 


1 My classmates make fun of me. 70 
75 Тат always dropping or breaking things. 68 
79 1 cry easily. 67 
61 When I try to make something, everything 63 

seems to go wrong. 
53 I am dumb about most things. 47 
59 My family is disappointed in me. 36 
13 It is usually my fault when something goes wrong. 34 
П Тат unpopular. 3 


Factor IVB (ES)—Anxiety Involving Withdrawing Behavior 


28 I am nervous, .68 
10 I get worried when we have tests in school. 397 
6 I am shy. , b 57 
37 I worry а lot. A 56 
20 I give up easily, E» 
43 I wish I were different. i 38 
7 I get nervous when the teacher calls on me. Lo 
50 I am unhappy. 2 30 


At the junior high school level two separate identifiable abasement- 
oriented factors (I and VII), which might have been anticipated to fuse 
às one general factor, did emerge as did two contrasting factors (V and 
XVIII) of self-contentment (happiness) and self-dissatisfaction re- 
flecting anxiety and alienation characteristics. 


‚43 
40 
39 

4 


51 
28 
74 
8 


Two other factors each consi 
two additional dimensions O 
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Factor IVA (JHS)—Abasement or Self-Depreciation 


I give up easily. .66 
І am unpopular. 229 
Гат often mean to other people. .50 
Му looks bother me. 45 
Гат shy. 42 
I cry easily. 30 


Factor IVB (JHS)—Self-Depreciation Involving Alienation and 
Self-Pity 


When I try to make something, everything .66 
seems to go wrong. 

My classmates make fun of me. .63 
It is usually my fault when something goes wrong. 43 
I hate school. 41 
I am clumsy. .38 
In games and sports, I watch instead of play. .37 
I am picked оп at home. 32 


Factor IVC (JHS)—Self-Contentment (Happiness) 


I am cheerful. 4 67 
I am a good person. T 62 
I am а happy person. 58 
I am well behaved in school. 42 
I am obedient at home. d 


I have a pleasant face. 


Factor IVD (JHS)—Self-Dissatisfaction ( Unhappiness) Involv- 
ing Anxiety and Instability 


I wish I were different. aK 
I feel left out of things. 2% 
I like being the way 1 am. fa 
I am often sad. | 
‚г -.42 
I һауе many friends. 2 
I am nervous. a2 
I am often afraid. | 32 


My looks bother me. 


sting of three item variables suggested 
f alienation and selí-centeredness, 
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although a dependable identification was not possible. These results 
are not reported. 

At the senior high school level only one clearly identifiable factor in 
the area of emotionality was found. Relative to self-dissatisfaction or 
unhappiness this factor was described in terms of the following items 
and their loadings: 


Factor IV (SHS)—Self-Dissatisfaction Reflecting Alienation and 
Anxiety 


58 People pick on me. 71 
40 I feel left out of things, .51 
1 My classmates make fun of me. 44 
6 I am shy. 42 
7 I get nervous when the teacher calls on me. .39 
50 I am unhappy. .35 


It should be mentioned that there was for the sample of senior high 
School students the strong suggestion of a status-power factor as in- 
dicated by loadings on four items: 


63 I am a leader in Bames and sports. 69 
27 | am ап important member in my class. 64 
15 Гат strong. 


36 Гат lucky, 


This perception of one's having status and influence might reflect the 
development Of a need for power in the social order on the part of 
adolescents in senior high school who have been acquiring an ap- 
Preciation of the importance Of power in adolescent and adult 
cultures, 

Other Factors 


hiec not identifiable for all three student groups, two inter- 


ceived Psychomotor Coordination (skill), 
larity the following results were obtained: 


^ Factor V— Popularity VHS) V(SHS) 


511 have many friends. 65 60 
3 It is hard for me to make friends, —.44 13 


Қ: 


А 
{| 


А! 


WILLIAM B. MICHAEL, ЕТ AL. 413 


69 Гат popular with girls. 56 32 
28 Гат nervous. — -.47 
6 L am shy. —.34 

58 People pick on me. 7233 ЕЗ 


The factor involving perception of psychomotor coordination was 
relatively clearly defined by two items in particular as follows: 


Factor VI—4( Perceived) Psychomotor 


Coordination VI(JHS) VI(SHS) 
19 Тат good at making things with my hands. 74 79 
23 Ісап draw well. 172 174 
16 I have good ideas. 40 46 
33 My friends like my ideas. .36 — 
Discussion 


The identification in each of the three sample differing in 
educational level of the same three constructs of physical appearance, 
socially unacceptable (“Бай”) behavior, and academic or school compe- 
tence (which was portrayed by two factors) replicated the first three 
constructs in the factor analytic investigation reported by Piers (1969). 
In what the writers preferred to call a domain of emotionality the 
results were somewhat complex. At the elementary school level two 
factors of self-depreciation (abasement) and anxiety were evident, 
whereas at the junior high school level four factors of abasement or 
self-depreciation, self-depreciation (with overtones of alienation, self- 
pity, and masochism), self-contentment (happiness), and self-dissatis- 
faction (unhappiness) involving anxiety feelings and instability 
emerged. Yet, at the senior high school level only one readily identi- 
fiable factor of self-dissatisfaction (unhappiness) reflecting alienation 
and anxiety components appeared. Thus the factors of anxiety and 
happiness reported by Piers were only partially replicated, and in the 
instance of the junior high school sample two abasement factors, one 
factor reflecting unhappiness, and one factor suggesting a positive 
affect of happiness were seemingly required to cover the measurable 
variance in emotionality. For senior high school students the one 
dimension of negative affect that could be isolated appeared to cut 
across the two dimensions reported by Piers (1969). This dimension 
actively constituted a reflection (negative direction) of the happiness 
factor described by Piers and incorporated elements of the anxiety fac- 
tor. In the junior high school and in the senior high school, but not in 
the elementary school sample, the factor interpreted to be popularity 
was also replicated. In addition, the suggestion of a new factor involv- 
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ing the perception of competence in psychomotor skills for the junior 
high school and senior high school samples was apparent, and the hint 
of a status or power factor was evident for the sample of senior high 
school students. 

Although the conclusion could be made that several of the factorial 
dimensions identified by Piers were replicated in the samples studied, 
the complex domain of emotionality involving such constructs as hap- 
piness (self-satisfaction or self-actualization), unhappiness (self-dis- 
satisfaction or lack of self-actualization), and anxiety with overtones 
of abasement, self-depreciation, masochism, and guilt yielded results 
that were not too nearly congruent with the factor analytic findings 
cited by Piers who used essentially the same factor analytic pro- 
cedures as those employed by the writers. Even though 
methodological difficulties in use of principal components extraction 
and in the subsequent varimax rotation might account in part for the 
inability of the investigation to yield as psychologically meaningful 
results as might be possible, there was the strong suggestion that the 
items making up the domain of emotionality were open to a greater 
variety of interpretations and to a more subjective evaluation upon be- 
ing read than were those items associated with physical appearance, 
intellectual or academic status, and so-called bad or troublesome be- 
haviors, Thus it would appear that major efforts in item writing to 
operationalize affective constructs that can be anchored to as compre- 
hensive and clear-cut theoretical formulation of the nature of affective 
behaviors in the self-concept as is feasible would enhance the factorial 


yalidity as well as the utility of a revised form of The Piers-Harris Chil- 
dren’s Self-Concept Scale. 
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THE CONTENT AND CONSTRUCT VALIDITY OF 
THE BARTH SCALE: ASSUMPTIONS OF 
OPEN EDUCATION" 


ANTHONY J. COLETTA 
William Paterson College of New Jersey 


ROBERT K. GABLE 
University of Connecticut 


Latent partition analysis was used to generate content categories 
from judgmental data gathered on Barth Scale items from 23 open 
education experts. Principal component analysis was employed to 
examine constructs derived from response data gathered from 78 
open and 113 traditional teachers. Alpha internal consistency 
reliabilities were developed for item clusters defining each dimension. 
Relationships between the judgmental categories and response 
dimensions facilitated naming the derived dimensions. Evidence of 
content and construct validity and of internal consistency reliability 
indicate appropriate scoring and interpretation (ог Barth Scale 
items. 


ALTHOUGH an increasing number of school systems have recently 
adopted open education practices, the approach has been subjected to 
little systematic research. In particular need of study is the develop- 
ment of instrumentation for research in the area of open education. 
Bussis and Chittenden (1970) have cited the importance of the Barth 
Scale for describing how teachers view their role and the process of 
children's learning. The Barth Scale could prove useful to the school 
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system as well as individual teachers for examining beliefs regarding 
learning, knowledge, and evaluation during the formative stages of 
implementing open classrooms. 

The Barth Scale contains 28 Likert items generated by Roland S. 
Barth (1970, 1971) to examine the many written and verbal statements 
offered by open educators.* Essentially, he endeavored to make ex- 
plicit many of the assumptions which underlie the practices of open 
education. 

The primary purpose of this paper was to report an examination of 
the content and construct validity of the Barth Scale. Latent participa- 
tion analysis (Wiley, 1967; Gable and Pruzek, 1972) was employed to. 
generate item content categories for judgmental data gathered from 
open education experts; factor analysis was used to describe constructs 
generated from response data gathered from open and traditional 
elementary school teachers. A secondary purpose of this paper was to 
examine the relationships between the judgmental categories (content 
validity) and the response dimensions (construct validity). The results 
of these analyses will be presented separately. 


Method 


Study I: Content Validity. Latent partition analysis (LPA) was 
employed to study the item universe sampled by the Barth Scale items. 
The specific purposes of this content validity study included: (a) to ex- 
amine judgmental data for meaningful content categories which reflect 
Judges’ sortings of items into mutually exclusive content piles, (b) to 
explore the association between categories for the possible merging of 


categories, and (c) to display resultant categories which were generated 
from the judgmental data, 


Sample 


Nea. employed for this aspect of the study consisted of 44 
23 а ореп education experts identified from a group of open 
aon E authorities. listed in the study by Walberg and Thomas 

). Twenty-three judges responded; they included writers, profes- 


sors, practiti i i 
я ‚р actitioners, supervisors, and consultants, experienced in teach- 
В ог observing in Open classrooms. 


of the Barth Scale 
th's Suggestion, items 10 and 22 were collapsed into one item in 


_ 
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Procedure 

The Barth Scale items were typed on individual slips of paper and 
mailed to the experts along with a brief letter of instructions asking the 
experts to sort the statements into mutually exclusive categories on the 
basis of similar item content. The judges’ categories, called manifest 
categories, were submitted to latent partition analysis. The LPA 
procedure results in a joint proportion matrix (of order 28 by 28) 
where each entry indexes the proportion of sorters who placed a given 
pair of items in the same manifest category. From this matrix, a latent 
category matrix is derived; entries within this matrix index for each 
item the derived (latent) category to which the item belongs (Gable 
and Pruzek, 1972; Wiley, 1967). 


Results and Discussion 
The latent category matrix is presented in Table 1. An examination 


TABLE 1 
Derived Approximation to Latent Category Matrix for 28 Items with 7 Categories 
Item CATEGORY NUMBER 
Number 1 2 3 4 5 6 7 
1 12 0 1 -8 -6 2 0 
2 i2 -1 2 7 -13 -6 -8 
3 798 0 0 -15 0 9 6 
6 48 3 9 18 27 0 4 
18 2 43 -4 42 21 0 -20 
20 -6 108 4 -15 1 0 9 
21 -2 109 -5 3 нгі 0 -2 
22^ 15 793 iG 6 10 -1l 2 
23 EIS 115 -4 -3 —4 0 2 
24 1 108 -4 = -10 5 6 
25 -6 24 70 12 10 0 -І4 
26 47182231 104 -6 0 a 9 
27 E 212 102 17 -2 7 ти 
28 4 0 108 -15 ШІ ТН “ 
29 —3 7 12 5”! ү” 
13 14 i = 713 7i 35 
16 i -8 24 140/14 2 2 1 
17 2 D 4 117 т 1 i 
19 10 2 -4 59 18 Жы 
4 —6 eu -2 19 14 ты -M 
5 50 1 -5 10 54 -u 3 
7 -30 -4 Wess is 112 1 7 
8 6 0 -5 0 106 FA 0 
9 13 сі 230707. -2 90 17 11 
11 8 Ws 22 1 5 92 78 
12 -5 4 0 0 -3 104 2 
14 =6 2 3 5 2 4 89 
15 1 4 1 0 0 =3 93 


Note.—Rows were reordered to facilitate interpretations; all entries hwe Peen gs eene 
items 22 and 10 from the original Barth Scale reported in Phi Delta Kappan, October, 1971 меге 
Combined into one item in this study. 
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of the entries in Table 1 denotes that seven latent categories were ob- 
tained. Twenty-six items with high loadings on only one category were 
selected for naming the categories. The following is a description of 
each category in terms of item content. 

Category 1 was called Exploration (EX), since the item content 
described the child's natural inclination to explore when learning. The 
second category was called Evaluation (EV) with each item describing 
the philosophy of assessment in the open classroom. Category 3 was 
titled Knowledge (K); all items in this cluster deal with the open 
educator's views concerning the role of knowledge. Categories 4 and 7 
were labeled Intellectual Development (ID), since the items referred to 
the process of cognitive development. Category 5 was called Choice 
(C); items defining this grouping suggested the importance of allow- 
ing children a choice in learning. It should be noted that item 5 contri- 
buted slightly to the naming of this category but also loaded 
moderately on Category 1: Exploration. Apparently the experts 
perceived the phrase “active exploration in a rich environment" as de- 
Noting both exploration and choice. The name given to Category 6 was 
Involvement (I) as both statements defining the category were con- 
cerned with the needs for children to share their learning experiences 
with others. 

Although the latent category matrix revealed several clearly defined 
Subuniverses of item content for the judgmental data, two categories 
reflected similar Subuniverses. Examination of Table 2, containing the 
indices of association between the pairs of latent categories, supported 
the similarity of the item content in Categories 4 and 7. Thus, these 
vie item clusters labeled Intellectual Development were combined. 

n summary, the LPA method provided an objective means of 


PA TABLE 2 
indices of Association between Latent Categories 


CATEGORY NUMBER 


1 2 T 
1 80 3 4 5 6 
2 5 70 
4 0 11 72 
5 22 11 12 64 

“© à 5 9 15 50 
7 20 д 4 9 25 80 
0 15 45 22 22 109 


Note. Entries in ih 
MAS Hest тый Ee ae zero and unity when the model fits the data. If the matrix is essentially 
Diagonal entries estimate the probabit, quo O SUL from differential splitting of the same latent categories 
into the same manifest category: ofr aus апу WO items in that category will, in a new partition, be sorted 
latent categories will be placed t, EOL Entries estimate the probability that two items from two differen 

р! in the same manifest category, АЙ decimal points have been omitted. 
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studying content validity by identifying subuniverses of item content. 
The identification of such content categories based on judgmental data 
gathered from content experts contributes greatly to the conceptual 
understanding of the constructs generated from a factor analysis of 
item response data. The results of such a factor analysis of response 
data are presented in the next section. 


Method 


Study II: Construct Validity. Factor analysis was employed to ex- 
amine construct validity by identifying the underlying dimensions or 
constructs which explain item response interrelationships. 


Sample 


The sample for the construct validity study consisted of 191 elemen- 
tary school teachers from the eastern United States. Seventy-eight 
open and 113 traditional teachers from city, rural, and suburban 
schools participated from the following states or districts: Florida, 
Washington, D. C., Maryland, New Jersey, New York, Connecticut, 
Massachusetts, and New Hampshire. 


Procedure 


Since the items on the Barth Scale tend to be grouped by Barth on 
the basis of similar item content, the items were randomly recorded 
before administering them to the teachers. For ease of interpretation, 
though, the original item numbers will be used in the tables presented 
in this section. 

Each teacher responded to 28 items on a 5-point Likert scale rang- 
ing from strongly agree to strongly disagree. A 28 by 28 item inter- 
correlation matrix was generated and submitted to a principal compo- 
nents analysis followed by an obliquimax transformation (Hofmann, 
1970). The derived dimensions or constructs described relationships 
between the Barth Scale items for actual response data. i 

In the section which follows, primary emphasis is placed on naming 
and interpreting the constructs, while attention is also given to the 
relationships between the final judgmental categories generated by the 
latent partition analysis and the response data constructs. 


Results and Discussion 


Through using the standard root criterion of 1.0, eight components 
were derived: seven of these components were defined by at least two 
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items with loadings above .35. The component loading matrix is 
presented in Table 3. 

Table 4 contains the factor names, original LPA item codes, Barth 
Scale item stems, and factor loadings. The naming of the derived fac- 
tors was facilitated greatly by considering the derived content cate- 
gories as noted by the LPA item codes. 

Factor I was called Curricular Flexibility. Items defining the dimen- 
sions reflect the questioning of the existence of a minimum body of 
knowledge and the children’s right and competence in making deci- 
sions concerning what they are to learn. Teachers tending to agree 
with the item content defining this factor would appear to be flexible in 
deciding what children will learn. These same teachers seem to con- 
sider the acquisition of knowledge a shared responsibility between 
themselves and the child who is the agent of his own learning and who 


TABLE 3 
Component Loading Matrix Using an Obliquimax Transformation 


Components 
Item I Ш Ш IV у УІ Vil 


8 36 42 
23 59 36 


9 40 47 


|. 65 
Я 58 
4l 46 
2 36 
70 
у 56 
2 52 
% 43 
5 79 
26 50 
18 43 
24 71 
21 63 
[ 57 
3 38 50 
5 65 
ie 57 


36 


h : 
pretation. plied by 100; Only entries >.35 were included; rows were recorded for ease of inter- 


ж.н 


COLETTA AND GABLE 421 


TABLE 4 
Factor Names, Original LPA Пет Codes, Barth Scale Пет Stems and 
Component Loadings for Derived Component Solution 


Original LPA 
Code and 
Factor Item Number Stem Summary д Loading 

I 

Curricular K 28 questionable minimum body of knowledge 76 

Flexibility с 7 right to make decisions 50 
© 8 choice in selection of materials 36 

И 

Intellectual K 29 knowledge resides in the knower .59 

Development Еу 23 observe over a long period of time 59 
ID 15 similar stages of intellectual development 53 
ID 17 abstractions follow experiences ‚50 
ID 14 learn at own rate and style 45 
С 9 engage in high interest activities 40 
Еу 20 measured qualities not important 35 

Ш 

Evaluating ID 19 errors expected and desired 65 

the Child K 25 qualities of being are more important 58 
K 27 knowledge is personal Al 
Ev 20 measured qualities not important 36 

у 

Learning Through Ev ^ 22,10* involvement-learning takes place-best assessed by 

Involvement direct observation ,70 
I 11 collaborate in exploring .56 
І 12 share something important E 
ID 15 similar stages of intellectual growth | 44 
Ех 2 self-perpetuating exploratory bchavior 43 
к 29 — knowledge resides in knower 36 

v 

Learning [e 4 confidence needed for learning and choices 79 

Facilitators c 5 exploration in rich environment. helps learning 50 
K 26 knowledge is personal integration of experience 43 

У 

Evaluating the Еу 18 verification of materials n 

hild's Work Ev 24 — work is best measure of work 63 

Ev 21 negative effect of objective measures 57 
K 27 knowledge is personal 46 
Ех | exploration independent of adults 38 

Vil 

teaming Through Ех 3 exploratory behavior if not threatened 65 

xploration Ex 6 _ play-predominant mode of learning 57 

Ex 1 exploration independent of adults 50 
с 9 engage in high interest activities 47 
ID 17 abstractions follow experience 42 
С 8 choice in selection of materials 20 


-- ID 16 concrete follows abstract 
* As suggested by Barth, items 22 and 10 from the original Barth Scale reported in Phi Delia Kappan, October, 1971 were combined 


|ті one item in this study, 
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has the competence to make decisions pertaining to what he will learn. 
Since teachers are faced with the day-to-day problems of teaching con- 
tent and of making or encouraging curricular decisions, it is 
reasonable that a dimension named Curricular Flexibility would 
emerge from the response data. Thus, high scores on Factor I would 
be obtained by teachers with positive attitudes towards curricular flex- 
ibility. 

Additional support is found for this interpretation of Factor I by 
examining the original LPA item codes in Table 4. Factor | was de- 
fined by one item judged to reflect knowledge (Category 3) in the cate- 
gorical data study; and two items were judged to reflect choice 
(Category 5). The remaining factor descriptions were generated after a 
similar consideration of the derived LPA item content categories. 

Factor II was called Intellectual Development as it is defined mainly 
by items judged to be descriptive of the child's intellectual develop- 
ment (See Original LPS codes in Table 4). Supportive of the items 
depicting intellectual development are statements concerned with the 
resulting appropriate evaluation philosophies. 

Agreement with the items in Factor П would show the teacher's 
understanding that children’s intellectual development is not always 
determined by verbal responses and that there is a need for an ex- 
tended time period to assess the effects of the school experience on 
each child. Further, teachers in accord with the items tend to believe 
that children pass through similar stages of intellectual growth, along 
а sequence from concrete experience to verbal abstractions, at their 
own rate and in their own style. Thus, a teacher’s score on the items in 
Factor И would indicate his beliefs as to how children develop intellec- 
tually. 

Factor III was defined by items concerning the teacher’s judgment 
of the child’s personal qualities rather than his work and, therefore, 
was titled Evaluating the Child. Teachers in agreement with the items 
apparently indicated the conviction that although difficult to measure, 
qualities needed in the search for knowledge (motivation, in- 
dependence, and perserverance) are more important than those 
qualities which are amenable to measurement. 

Whereas the experts in the content validity study tended to sort all 
the items dealing with evaluation into one category (II, Evaluation), 
the response data dimensions suggested a more specific interpretation. 
The teachers apparently perceived evaluations as a twofold process, 
separating the child as a person from the child’s work. That is, the 
response data indicated that the items would be highly significant 
within the framework of two aspects of evaluation reflected in Factor 
П and Factor VI, called Evaluating the Child's Work. Ostensibly, the 
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personal nature of the teacher-child relationship may explain the 
teachers’ interpretation of the items. 
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Factor IV was called Learning Through Involvement, as it 
emphasizes child involvement as a key issue in learning. Such involve- 
ment is supported by collaboration in exploring and sharing with 
others. Items contributing less to the naming of the factor state that 
children pass through similar stages of intellectual growth and that ex- 
ploratory behavior is self-perpetuating. These items partially con- 
tribute to the naming of the factor since children in similar stages are 
more likely to become involved in exploration. Such investigation en- 
courages new discoveries leading to further exploratory behavior. 
Teachers who agree with these items apparently believe that involve- 
ment with an activity stimulates learning which the child may want to 
communicate to other interested children. 

Factor V was called Learning Facilities, as items state the need for 
confidence, active exploration, and personal integration of experiences 
as facilitators in the learning process. Teachers scoring highly on these 
items tend to believe that confidence assists a child in making responsi- 
ble choices regarding his own learning. Further, they tend to believe 
that the classroom environment must contain rich and varied 
manipulative materials designed to facilitate learning. Finally, teach- 
ers scoring highly on this dimension indicate that learning is facilitated 
when knowledge is personally integrated from experience and is not 
cut into separate disciplines. Therefore, a teacher who earns a score on 
Factor V would be indicating the extent 10 which he believes that con- 
fidence, active exploration, and personal integration. of experience 
facilitate learning. 

Factor VI was named Evaluating the Child's Work. The items defin- 
ing this factor emphasize the importance of the child's materials and 
actual work in evaluating his performance. Teachers in agreement 
With the items would seem to perceive materials as providing feedback 
information to the child which verifies whether he has answered the 
question or solved the problem. The same teachers tend to place much 
emphasis on their intuitive judgment of the child’s work rather than on 
objective tests. They no doubt would maintain that objective mea- 
sures create needless stress which results in the negative effect on learn- 
ing for the young child. Moreover, those agreeing with the items tend 
to believe that systematic evaluation fails to measure accurately the 
personal and unique knowledge which a child possesses. Overall, a 
Score on Factor УІ exhibits the teacher's beliefs regarding the impor- 
tance of materials and intuitive judgment rather than objective 
measures as the focal points in evaluation. 

Factor VII was called Learning Through Exploration. The construct 
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is concerned with the exploratory nature of learning. Teachers Scoring 
highly on this construct probably believe that if unthreatened the child 
will display exploratory behavior independent of adults. Teachers in 
agreement with the items also appear to maintain that play, the most 
natural kind of exploration, is not distinguished from work in early 
childhood; and that when the child is given considerable choice, he is 
likely to choose activities of high interest. The construct was defined 
by items clearly judged to reflect the role of. exploration in learning. 
(See original LPA code in Table 4) Thus, a respondent's score on Fac- 
tor VII indicates his convictions regarding the importance of explora- 
tion in learning. 


Factor Intercorrelations 


The intercorrelations of the primary axes were generated. The 
magnitudes of the correlations (range = —27 to 4.23; г = 4.16) did 
not suggest the need for collapsing the factors into a fewer number of 
dimensions, 


Reliability 


The alpha internal consistency reliabilities of the derived factors 
were estimated by calculating the average of all possible combinations 
of item correlations and employing the Spearman-Brown formula 
(Stanley, 1957). The factor names, number of items per factor, and the 
resulting reliabilities were as follows: Curricular Flexibility——3, .64; 
Intellectual Development——7, .74; Evaluating the Child——4, .73; 
Learning Through Involvement— — 6, .73; Learning Facilitators—— 3, 
.62; Evaluating the Child's Work——5, .72; Learning Through 
Exploration——7, .76. Examination of the reliabilities indicates that 
several dimensions are associated with low reliabilities. It appears that 
future research on the Barth Scale should include the creation of ad- 
ditional items for most scales to increase their levels of alpha inter- 
nal consistency reliability. 


Summary and Conclusions 


The judges tend to sort the Barth Scale items into general and easily 
identifiable content categories. In some cases, the response data, аз 
compared with the Judgmental Data, reflected a more specific in- 
terpretation of factors. An example can be seen when one looks àt 
judgmental Category 2 (Evaluation). The content experts sorted all the 
items pertaining to evaluation into that category. In contrast, the 
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response data suggested two aspects of evaluation: The Child (Factor 
III) and the Child's Work (Factor УІ). It was suggested that the per- 
sonal nature of the teacher-child relationship contributed to the 
teachers’ interpretation of the items. 

Thus, the results of the content validity study based on judgmental 
data gathered from content experts contributed to naming the dimen- 
sions derived from analyzing actual classroom teacher response data. 
Although facilitating the understanding of the constructs under study, 
the differences between the judgmental categories and response data 
dimensions do lead one to consider the possible disparity between 
theory and practice in open education. 1f disagreement between 
theorists and practitioners does exist, what are the resulting implica- 
tions for open classroom implement? It appears that future research in 
open education might take this question into consideration. 
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DO THESE CO-TWINS REALLY LIVE TOGETHER? 
AN ASSESSMENT OF THE VALIDITY OF THE HOME 
INDEX AS A MEASURE OF FAMILY 
SOCIO-ECONOMIC STATUS’ 


LOUISE CARTER-SALTZMAN? AND SANDRA SCARR-SALAPATEK 
University of Minnesota 


WILLIAM В. BARKER 
University of Pennsylvania 


In a study of 400 pairs of same-sex twins, ages 10-16, in the 
Philadelphia area, the Home Index was used as a measure of SES. 
Because there were numerous disagreements between co-twins, 
analyses were done of each item. The Home Index was then rescored 
using only the 10 items that reached a criterion level of 75% twin 
agreement. Two scoring methods were used on the 10-item Home 
Index: one treating "don't know" as equivalent to a blank (yes = 2, 
no = 1, don’t know = 0) and the other being a variation on that scor- 
ing method by giving additional weight to “don’t know" responses 
(yes = 2, no = 0, don't know = 1). By correlating the three scorings 
(the original scoring of 24 items and the two scorings of 10 items) of 
the Home Index with each other, with census tract data, and with five 
cognitive measures used in the study (Raven Standard Progressive 
Matrices, Peabody Picture Vocabulary Test, Columbia Mental 
Maturity Scale, Benton Revised Visual Retention Test, and a Paired- 
Associate Test), it was determined that the original Home Index was 
a more valid measure for white subjects than for black subjects. It 
could not, however, be recommended as à highly valid measure of 


SES in either group. 


IN a study of 400 pairs of same-sex twins in the Philadelphia area, an 
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individual measure of socio-economic status, the Home Index 
(Gough, 1954, 1970) was administered to all of the children, ages 1010 
15 years. The Home Index is intended for use with older elementary, 
junior high, and high school students to assess family status and life 
style characteristics (Gough, 1949; 1971a; 1971b). It is reported to cor- 
relate highly with other measures of socio-economic status (SES) such 
as other status questionnaires and parental occupational levels 
(Gough, 1949, 1971a) and to predict college attendance (Gough, 
1971b). 

The present study assessed the validity of reports on family social 
status by comparing responses of two members from the same family, 
in this case co-twins, Co-twin agreement on factual items about their 
family is a particularly good criterion for evaluating the validity of an 
SES measure. If two children in the same family, especially siblings of 
the same sex and same age, do not agree on the information, then one 
must question the usefulness of information obtained in the scale. 


Methods 
Subjects 


Twin pairs were recruited by letter from a complete list of twins in 
the Philadelphia public schools and from parochial and suburban 
schools by newspaper articles and a television news broadcast. The 
final sample of twins tested included 399 pairs and two sets of triplets. 
Of the 175 black twin pairs, 157 attended Philadelphia public schools, 
18 other Philadelphia schools, and none suburban schools. Of 224 
white pairs, 89 attended Philadelphia public schools, 42 other 
Philadelphia schools, and 93 suburban schools. 

The twins came from the greater metropolitan area to form a 
representative sample of families in the Philadelphia area. Median in- 
come and educational levels were computed for the census tracts from 
which the twins were drawn. The median values of family income in 
the samples of blacks and whites are very close to the median figures 
reported for the Philadelphia metropolitan area. For the twin sample, 
the whites’ median income in 1970 was $11,000, median education 


was 11.9 years; blacks’ median income was $7,910, median education 
was 10.2 years, 


Procedure 


About 10 pairs of twins were tested at the same time; they were 
seated in alternate chairs and rows in a large auditorium. Co-twins 
were separated into different small groups, each with an adult leader 
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who answered questions and guided them through the afternoon's as- 
sessments. There was no opportunity for co-twins to collaborate on 
the tests. 

The Home Index and other psychological measures were presented 
on 35 mm. slides, synchronized with audio tapes of instructions and 
items read aloud. No reading skills were required. Answer sheets were 
specially formatted for each test with only the appropriate number of 
items and answer alternatives. Items and alternative responses were 
numbered and lettered in black, primary-school type for increased 
clarity. Instructions on the use of the answer sheets were given prior to 
the Home Index, and children's accuracy in use of the sheets was noted 
by the group leaders for the first two Home Index items. This method 
of administration did not decrease the reliabilities on any of the 
cognitive measures. 


Measures 


Home Index. The 1954 version of the Home Index consists of 24 
items, divided into four subscales derived by factor analysis: social 
status (8 items), ownership (10 items), socio-civic involvement (4 items), 
and aesthetic (2 items). Possible responses to the item questions were 
“yes,” scored as 2 points, “по,” scored as 1, and “don’t know," scored 
as 0. Blanks were also given а 0 score. This scoring deviates from 
Gough's method which gave “yes” responses a score of 1, and “по” 
responses a score of 0. The authors included a “don’t know" response 
to increase the validity of the data obtained. | 

Although in Gough's 1970 version of the Home Index the items 
"radio" and “television” were deleted, in the current form "radio" 
was retained, and “television” changed to “color television" in the 
hope that the discriminability of the item would be increased. 

Cognitive measures. Five cognitive tests were administered to the 
sample as part of two sessions, each lasting about 1% hours: the 
(Raven) Standard Progressive Matrices, sets A, B, C, and D (Raven, 
1958); the Peabody Picture Vocabulary Test (Dunn, 1959); the Colum- 
bia Mental Maturity Scale (Burgemeister, Blum, and Lorge, 1959); the 
Revised Visual Retention Test (Benton, 1963), and a paired-associate 
learning test (Stevenson, Hale, Klein, and Miller, 1968). 

Census tracts. Median education and median income levels for every 
census tract in which a twin family lived were ascertained from the 
1970 U.S. census. The distribution of income and educational levels 
was divided around the median to form higher and lower SES sub- 
groups in each racial group. Mixed cases of above and below median 
values for income and education were assigned to the lower SES sub- 
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groups. Because the overlap in census tract medians was so small, it 
was impossible to use a common median for the two races. Thus, 
higher and lower SES subgroups were not comparable across race. 

Other measures. Many other measures of dental and physical 
growth, personality and self-esteem, blood groups, taste preferences, 
and dermatoglyphics were obtained but are not reported in this paper. 


Statistical Analyses 


To obtain measures of co-twin agreement, intraclass correlations 
(McNemar, 1962, p. 284) were computed. Interclass correlations of 
Home Index scores with cognitive measures and with census tract data 
were calculated for a sample comprised of one twin from each family. 

Home Index records with six or more consecutive missing answers 
or ten or more total missing answers were deleted from the study. 
Fourteen of the 804 records were lost in this way. Only 28 of the 
remaining subjects had one or two missing answers, which were scored 
as blanks. A few additional subjects were lost to subsequent analyses 
because of incomplete data on other measures. 


Results 


Twin Agreement 


Correlations of the total Home Index score of one twin with the co- 
twin revealed substantial disagreement on family information. The 
overall correlation for all twin pairs was only .72. Further analysis of 
subgroups indicated that disagreements were far more common for 
black pairs (r — .57) than for white pairs (r — .77), and that higher SES 
pairs in both races had somewhat less agreement than did lower SES 
pairs. 

An analysis of agreement for each item was done to explore the 
Sources of co-twin variance. Of the 24 items in the Home Index, only 
10 were found to achieve 75% agreement in both racial groups, and 
only three items to achieve 90% agreement. The distribution of agree- 
ment (percentage of pairs where both twins gave the same response: 
"yes," "no," or *don't know") by subtest is found in Table 1. Male 
pairs and female pairs were considered separately. 

Overall agreement was higher for white than for black pairs on all 
items. There were no consistent sex differences in agreement for the 
white pairs, but black male pairs were found to agree more often than 
were black female pairs for 18 of the 24 items, with two ties. 

Using Gough's 22-item version (1970) in the analysis was con- 
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TABLE 1 
Percentage of Co-Twin Agreement on Home Index Items 


Black Pairs White Pairs 
Male (73) Female (96) Male (120) Female (104) 


Subtest 1: Social Status 
6. Did your mother go to high school? 
7. Did your mother go to a college or university? 
8. Did your father go to high school? 
9. Did your father go to a college or university? 
Do you have a fireplace in your home? 
. Does your family have any servants, such as a 
cook or maid? 
. Does your family leave town every year fora 
vacation? 
23. Does your family have more than 500 books? 


Subtest 2: Ownership 
*|. Is there an electric or gas refrigerator in your 

home? 

. Is there a telephone in your home? 

. Do you have a bathtub in your home? 

4. Is your home heated with a central system, such 
as by a furnace in the basement? 

. Does your family have a car? 

. Does your family have a radio? 

. Does your family have a phonograph (record 
player)? 

. Do you have your own room at home? 

. Does your family own its own home? 

. Does your family have а color television at 
home?! 


Subtest 3: Socio-Civic Involvement 

17. Does your mother belong to any clubs ог 
organizations, such аз... 

18. Does your father belong to any civic, study, 
soc., or polit. clubs, ... 

21. Does your family subscribe to a daily 
newspaper? 

22. Do you belong to any club where you have to 
pay dues? 


Subtest 4: Aesthetic Involvement 
Hy ро you have a piano in your home? ; 
- Have you ever had private lessons in music, 
dancing, art, etc., outside of school? 


15 67 87 86 
56 61 68 73 
60 59 77 82 
64 53 74 73 
86 79 91 93 
81 79 90 89 
60 59 6 84 
52 42 62 62 
93 84 96 97 
99 94 99 96 
95 93 98 98 
59 59 73 78 
89 85 98 98 
97 92 97 98 
85 81 87 86 
75 68 76 88 
68 65 76 75 
89 90 98 97 
67 66 73 М 
64 70 68 70 
62 60 81 9% 
68 74 72 15 
89 88 98 96 
66 68 15 73 


Items that achieved 75% or more agreement. 
This item was changed from television to color television 


Sidered, but since 


fourth most reliable items, respectively, 


analyses. 


Since it is possible that twin agreemen 


because the original item had no correlation with social 


| status. 


“radio” and “color television” were the third and 


they were retained for all 


t might increase as а function 
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of age, item agreements were correlated with age for all four race and 
sex groups. The correlations were highest for white males (r — AS p< 
001) and lowest for black males (ғ = .11, p > .05). 


Rescoring the Home Index 


Since only 10 items achieved 75% agreement between co-twins in 
both racial groups, these items were selected for rescoring. Two 
methods were used to rescore the 10-item Home Index: one approach, 
identical to that used to score the 24-item Home Index, which treated 


“don’t know’ as equivalent to a blank (yes = 2, по = 1, and don't 
know or blank = 0), and a revised method which gave intermediate 
value to the “don’t know” response (yes = 2, don’t know = 1, no and 


blank = 0). The second method was based on the information value of 
unsure responses, as “don’t know” indicates uncertainty about 
whether a “уез” or a “по” response would be more nearly accurate. 
The 10-item Home Index scored by the first method was designated as 
Home Index 10,; the 10-item Home Index scored by the revised ap- 
proach as Home Index 10,; whereas the originally scored Home Index 
of 24 items was referred to as Home Index 24. 

To estimate the usefulness of the rescoring procedures, it was neces- 
багу to examine the co-twin agreement on the revised scales and 10 
choose criterion variables for validation. Although census tracts are 
not good measures of individual SES characteristics, such measures in 
reflecting general neighborhood factors should correlate positively 
with individual measures. In addition to census tract data, the five 
Cognitive measures were chosen as criteria because they had often been 
found to be positively, correlated about .3 to .4 with SES 
characteristics within both black and white groups. Table 2 gives the 


TABLE 2 


Correlations of the Three Home Indices and Criterion Variables for Black and White Twins 


Whites (N — 223) 
Measures 1 2 3 4 5 6 7 8 9 


- 


1. Home Index 24 
2. Home Index 10, 
3. Home Index 10, 
4. Raven 
5. Columbia 
6. Peabody 
7. Benton (errors) 
8. Paired-Associate 
9. Census Tract Educ. 
10. Census Tract Income 
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Blacks (N = 172) 
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intercorrelations of the three Home Indices and their correlations with 
the criterion measures. 

The three scorings for the Home Index (the original scoring of 24 
items and the two different scorings of ten items) gave results that were 
positively intercorrelated, as expected. The 10, and 10; versions were 
highly related, especially for the white pairs. The Home Index 
measures did not correlate so highly with the cognitive tests as did the 
census tract median data in the black group. In the white group, the 
three Home Indices did correlate slightly more highly with the 
cognitive tests than did the census tract data. Although the three 
Home Indices were positively correlated with census tract data in 
both racial groups, the coefficients were not very high. 

Co-twin agreement by race and SES (census tracts) for the Home In- 
dices is given in Table 3. Home Index 10, is clearly superior in degree 
of co-twin agreement to the other two versions of the scale in both 
black and white groups and for the lower and higher SES pairs within 
both races. Agreement between co-twins was lowest on the 10, version 
and highest on the 10, version. This result indicated that "don't know" 
was an intermediate response, given when one twin had information 
(yes or no) and the other did not. 

One possible explanation of the low correlations between the 
rescored Home Indices and the criterion measures could be the 
reduced variance of scales with ranges of 0-20 instead of the original 
0-48 range of the 24-item Home Index. However, it was found that 
Home Index 10, did have a higher variance than did 10, but correlated 
no more highly with the criterion measures. It is, therefore, doubtful 
that reduced variance could account for the correlational results. 

It is also possible that the 10 valid items did not discriminate to a 
high degree between groups because of the marked unevenness of fre- 
quency distributions of responses over item alternatives. For example, 
nearly everyone has a bathtub, refrigerator, and radio, but few have 
servants or a piano. There are several items, however, for which the 
frequencies of possession and nonpossession are relatively equal (color 
TV, fireplace, telephone, car, and phonograph). 


Race and Age Differences 

Overall, the white co-twins agreed more often than black co-twins 
on family information. With a 90% agreement criterion, eight items 
qualified as valid for the white group, whereas only three items were 
valid for blacks (see Table 1). A 75% criterion selected 15 of the 24 
items for both sexes of white twins and only 10 items for the blacks. 
White twins did agree more often with each other than did black twins 
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TABLE 3 
Twin Correlations for Three Versions of the Home Index by Race and Race and SES 
(Census Tracts) 


Black White Total Í 
Lower Higher All Lower Higher All 
(94) (78%) (17) | 024) (97) (221) 
Home Index 24 199 44 :57 79 .65 177 12 
Ноте Іпдех 10, dE E35 28 .60 79 571 49 
Home Index 10, 48 70 .60 79 88 85 74 


on parental education, home ownership, newspaper subscription, and 
having their own bedrooms. 

Older twin pairs tended to agree more often than younger pairs, es- 
pecially in the white group. In the black group, however, disagree- 
ments were far more frequent and less related to age than in the white 
group. In the white group the validity of information received from the 
Home Index clearly increased with age—a finding which suggests that 
a restriction of use of the Home Index to white high school students 
would increase its validity. In the black group, for whom increased age 
did not improve the validity of information given, the Home Index 
was probably not a useful instrument. 


Toward SES Measurement 


The criterion measures of median census tract education and in- 
come levels and the five cognitive scores correlated more highly with 
the rescored Home Indices in the white than in the black group. 
Gough (1971a) reported a correlation of .21 between the original 
Home Index and intellectual ability in a high school population. The 
correlations between the cognitive measures and (a) the original set of 
scores and (b) each of the two sets of scores for the 10-item Home 
Index were of similar magnitude in the white twin sample. 

The size of mean differences between the white and black groups on 
the Home Index measures was small, as compared to that of the 
differences in the census tract data. On the Home Index, 18.5% of the 
black children equalled or exceeded the median for the white children. 
The extent of census tract overlap was considerably less: for education 
about 8% of the blacks were equal to or exceeded the white median, 
and for income about 2% of the blacks were at or above the median of 
the whites. The Home Index appeared to minimize estimates of SES 
differences between the two racial groups and within each racial 
group. 

The lack of co-twin agreement on many of the Home Index items 
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raises serious questions about its usefulness as a measure of individual 
SES characteristics for a 10- to 16-year-old group. That the Home 
Index failed to correlate with intellectual measures more highly than 
with census tract data and that it failed to discriminate clearly between 
disadvantaged and advantaged populations would suggest that its 
validity is in doubt. Since ownership items proved to be the most 
reliable, it is recommended that additional items, preferably items that 
would afford alternative responses at intervals within the middle range 
of the SES distribution, be added to a revised scale, and that data be 
collected from two informants in each family to substantiate the 
validity of a new scale with 10- to 16-year-old subjects. Parental re- 
ports of SES characteristics would be a particularly valuable com- 
parison for the reports of their children. 
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STABILITY OF STUDENT EVALUATIONS OF INSTRUCTORS 
AND THEIR COURSES WITH IMPLICATIONS FOR 
VALIDITY 


HENRY J. OLES 
Southwest Texas State University 


A course-instructor evaluation form was administered to 775 un- 
dergraduates in 15 large and small section introductory courses after 
the second class meeting and again near the end of the semester. The 
median pretest posttest correlation was +.60. Students were gen- 
erally more negative toward their course and instructor at the end of 
the semester than they were at the beginning. 

As a separate portion of this project, two instructors deliberately 
attempted to alter their students" evaluation in one of two large sec- 
tions of their introductory psychology course. Іп both cases, there 
was a significant overall mean difference between the experimental 
and control groups on the initial evaluation but there was no 
difference on the end-of-semester evaluation. 

The results of this study indicated that although students quickly 
form reasonably lasting judgments of their instructors and courses 
they are also able to alter their judgments as warranted by changing 
situations. These findings appear to provide support for the validity 
of student evaluations. 


THE use of student evaluations of faculty and courses is now com- 
mon on most college campuses with the evaluative information being 
used both by students and administrators for decision making. 
Although many arguments have been made against using student 
evaluations as а primary criterion for professional advancement 
(Dressel, 1973), most of the research findings have indicated that stu- 
dent evaluations are reliable and reasonably valid indicants of teacher 
Performance. Regardless of the opinion of academia, student evalua- 
tions are being used at most institutions of higher education for a wide 
variety of purposes. Therefore, it is of prime importance to continue to 
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conduct research on student evaluations to determine and improve 
their reliability, validity, and utility. 

Several recent studies have attempted to determine the relationship. 
between ratings made while the course was in progress with those 
made at the end of the course (Dick, 1967; Costin, 1968.) Bausell and 
Magoon (1972) reported a median correlation of .67 between ratings 
made at the end of the first class period with those made at the end of 
the semester. This finding was of particular concern to this researcher, 
since it could indicate that students enter a course with a definite рге- 
disposed and unalterable set of feelings about the course and instruc- 
tor or that they quickly form a rigid and lasting set of attitudes after 
only minimal exposure, 


Purpose 


This study was designed to replicate and expand upon the work of 
Bausell and Magoon concerning stability of ratings of selected teacher 
and course characteristics from the beginning to the end of a semester. 
Within the college and university setting such information was 
thought to be of importance in judging the potential validity of an 
evaluation of a course and teaching performance at two widely sepa- 
rated time points, Bausell and Magoon used upper level undergrad- 
uates and graduate classes with a median size of 15. In addition they 
emphasized à standard student evaluation form for collecting both be- 
ginning and end-of-course evaluations. In a pilot study this researcher 
has found that undergraduate students vehemently objected 10 
evaluating a teacher or course after the first or second class day when 
they were required to complete a form that obviously had been de- 
signed for use at the end of the semester. Therefore, the form used in 
this study was designed to overcome student objections through care- 
ful wording of the directions to the respondent on the pretest as well as 
through emphasis of the fact that the form was specifically designed to 
measure their first impression of the instructor and his course. In addi- 
tion, each of the questions was worded to make it appropriate for à 
first impression evaluation. The posttest was essentially the same as 
the pretest with only minor changes in tense (i.e., whereas the pretest 
stated “This teacher seems to be , . „ће posttest stated “This teacher 
WAS: aera 

Virtually all research on student evaluations has been conducted 
after the fact. This researcher and a colleague each taught two €s- 
sentially identical sections of introductory psychology with approxi- 
mately 125 students in each section. A deliberate attempt was made to 
create a negative first impression in the experimental section by begin- 
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ning the course with an unusually dry lecture on the historical roots of 
` {he science of psychology and on the methods of science to determine 
vhether this treatment would alter the student ratings as compared 
th those of a class receiving а high interest-arousing presentation, If 
- a variation in instructor performance was reflected in student ratings 
` in the direction intutively expected, the result would add to the con- 
struct validity of student ratings in general, since many skeptics have 
` insisted that student ratings are not directly related to any meaningful 
leacher behavior. 


Methodology 


Тһе subjects for this study included 1302 undergraduate students 
— enrolled in 15 lower division courses taught by 13 different instructors 
with class sizes ranging from 17 to 154, 

_ The scale consisted of 22 evaluative items (21 on the pretest) cover- 
_ ing various dimensions of instructor performance and of the course, 
_ Each item consisted of a statement describing the instructor or the 
` course followed by four or five evaluative phrases ranging from very 
positive to very negative. The overall rating was composed of û simple 
summation of the individual ratings. Lower numerical ratings іп- 


with the posttest which was administered by the same person during 


test when they took the pretest. An identification system was used to 
| enable the matching of pre and post evaluations that still permitted re- 
-Spondents to remain anonymous. 

“Two of the 13 instructors involved іп the project each taught two es- 
ntially identical sections of introductory psychology. Their normal 
proach to beginning the introductory course was quite different. 
instructor (instructor A) used several interest arousing lectures 
Ше the other (instructor B) plunged іп the first day with an ad- 
edly dry, at least in terms of student interest, lecture on the 
sthods of science and historical perspectives in psychology. Each 
ctor agreed to attempt to alter his behavior in one class to match 
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that of his colleague. This procedure resulted in two classes that re- 1 
ceived a high interest introductory lecture and two classes that re- 
ceived a rather low interest lecture. The instructors were then told to 
continue the semester after the second day with their standard style of 
teaching. Ideally, this portion of the study should have been extended _ 
to a significantly larger number of instructors and courses. However, 
this researcher was concerned about the moral and ethical obligations 
of every teacher to do his best in teaching his courses. Therefore, the 
decision was made to use deliberate modification of normal teaching | 
practice in only two highly controlled situations even though this de- | 
cision would result in some questioning of the validity and gen- 
eralizability of the findings. | 
Two threats to the internal validity of the procedures employed іп 
this investigation were (a) the possible effects of. having taken the pre- 
test on posttest performance and (b) the students’ familarity with the _ 
instructor before the first class meeting. Bausell and Magoon specifi- 
cally examined their data for pretest sensitization and found none. Мо 
similar test was performed in this study; however, observation of stu- 
dent reactions to the posttest indicated that they had virtually for- 
gotten having taken the pretest three months previously. None of the 
students had had any previous classroom exposure to the instructor, 
since all the courses were introductory. 


Results and Interpretation 


Pretest- Posttest Comparisons 


Of the 1320 students who took the pretest, 775 were matched with 
their posttest ratings. Approximately 40% of the subjects was lost be- 
cause of absences, withdrawals, incomplete forms, and inability to 
match the two forms. 

Table 1 presents the percentage of subjects selecting each response 
option for 21 items on the pretest and posttest and for one item found 
only on the posttest. The most interesting finding is the large propor- 
tion of students who chose the most favorable response options, 0 and 
1. Response option 2 was rarely selected for most items while option 3 
Tesponses were essentially nonexistent, especially on the pretest. Stu- 
dents were evidently inclined to give positive ratings even to relatively 
poor teachers. 

The mean ratings for each of the evaluative items was calculated for 
each class on the pretest and posttest. Pretest posttest mean ratings 
were significantly different at the .05 level for nine items. As compared 
with their average pretest standing, the average posttest standings of 
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TABLE 1 
Percentage of Students Selecting Each Response Option 
Response Option Pretest Response Option Posttest 
Item 0 1 2 3 4 0 1 2 3 4 

„ Interest in Course 40 48 9 2 0 28 44 18 7 2 
. Course Difficulty 4 39 47 9 0 8 43 40 9 1 
‚ My Grade 22 62 15 0 0 9 40 42 9. 1 
. Textbook 17 47 32 5 0 17 36 32 12 4 
. Course Organization 39 57 3 0 40 53 6 1 
. Teachers Knowledge 79 21 0 0 TBS ak 1 0 
. Teachers Attitude Toward 

Course 67 30 2 1 64 32 3 1 
‚ Teachers Explanations 58 37 5 0 50 39 9 1 
. Intellectual Stimulation 24 66 10 0 19 60 20 1 
. Speaking Ability 68 30 2 0 63 33 3 1 
. Teachers Attitude Toward 

Students 832720 7516 1 $6: (1320811 1 
. Grading Fairness 1827249 3 0 53 42 4 1 
. Toleranceto Disagreement 58 39 2 0 54 41 3 2 
. Teachers Personality 59 36 3 2 57 38 3 2 
. Overall Rating 20 47 32 2 24 47 25 4 
. Desire to Attend Class 58 40 1 1 27,4 1 2 
‚ Value of Attendance 96 4 1 ы 15 4 
. Utilization of Time 80 19 1 0 70 25 4 1 
‚ Amount Learned 64 34 2 41 4 8 
. Satisfaction With Course 70 26 4 775217 6 
‚ Sticks to Subject 66 32 2 652,792 4 
. Recommend to Friends 

(posttest only) СЕО 


the students indicated significantly less interest in their course, expec- 
tation of a lower grade, greater objectionableness to the textbook, 
finding explanations by the teacher more inadequate, less desire to at- 
tend class, seeing less value in attending class, and wasting of more 
class time by the instructor at the end of the semester than at the be- 
ginning. However, students did see examinations and grading as being 
more nearly fair at the end of the course than at the beginning even 
though many had expected to receive considerably lower grades than 
they had initially anticipated. ў 

The median pre-posttest correlation for all 21 items was .60 and 
ranged from —.11 for the amount of information learned to +86 Гог 
course difficulty and attractiveness of the teacher's personality. Gen- 
erally, the obtained correlations could be rationalized. Those aspects 
of the course that could potentially be reliably and validly assessed at 
the beginning were highly correlated with posttest ratings, whereas 
those aspects that could conceivably be accurately rated only after sev- 
eral weeks of exposure showed low correlations. 


Table 3 presents the results of a deliberate attempt by two instruc- 
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TABLE 3 
Comparison of Pre- and Posttest Ratings When Instructors Deliberately Altered 
Their Normal Teaching Style 


INSTRUCTOR A 
Pretest Pretest Posttest Posttest 
{interest arousing) (по interest) (interest arousing) (по interest) 


= 48 M = .63 t= 2.83* М = .58 М = .59 аў, 
M = 36 SD = .33 r= .80 SD = ‚36 SD = 39 я 


INSTRUCTOR B 
Prestest Pretest Posttest Posttest 
(interest arousing) (по interest) (interest arousing) (по interest) 


М = .95 1 = 4.97** М = 1.16 М = 1.20 = 
505 be SD = 42 r= 2.73 SD= .52 SD= 48 қт 


* Significant at .02 level. 
** Significant at .01 level, 


sentially equivalent sections of introductory psychology. Both instruc- 
tors used a highly interesting introductory lecture in one section and a 
rather dry monotone lecture in the other section. The difference be- 
tween the mean pretest ratings for Instructor A were significant at the 
02 level and for Instructor В, beyond the .10 level. That there were no 
Statistically significant differences on the posttest ratings for either 
instructor indicated that the students were indeed able to alter their 
first impression ratings to fit the instructors typical performance 
shown throughout the semester. Although the differences in mean rat- 
ings between the interest and noninterest arousing introductory lec- 
tures were highly significant, the generalizability of this finding is low 
because of the small N (2 instructors, 4 sections). However, this re- 
searcher believes that these findings are of critical importance in 
demonstrating the validity of student evaluations. This portion of the 
study demands replication on a larger scale, if adequate control can be 
maintained to protect those students who may be inadvertently nega- 
tively affected by unknowingly being part of the experimental group. 


Summary and Conclusions 


| tors to alter their initial expected student ratings in two of the four es- 


This i investigation examined three aspects of student rating of col- 
lege i instructors; (a) the distribution of ratings given after the first or 
Second class meeting and again during the last week of the semester; 
(b) the correlation between mean pretest and posttest ratings for each 

" item using fifteen classes; and (c) the effects of short-term deliberate 


manipulation of teaching style on ratings. 
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The data in Tables | and 2 reveal that students in this study had a 
definite tendency to rate instructors positively on both the pretest and 
posttest, although ratings on the posttest were generally more nega- 
tive than those on the pretest (and variable) as shown by the higher 
rating values assigned. Only one item, expected course difficulty, had 
ratings above the expected mean (1.5) on the pretest. The overall mean 
for the combined 21 items on the pretest and posttest were .62 and .72, 
respectively. Those institutions that passively permit some use of stu- 
dent evaluations without offering individual faculty members a sta- 
tistical analysis of their ratings with respect to those of other members 
of the department, school, or institution, may in actuality be promot- 
ing a false sense of satisfaction and security among faculty, since in- 
dividual faculty members may not be aware of the students tendency 
to report above average ratings. Obviously the reliability and differ- 
ential validity of student evaluations would be improved if techniques 
were used to encourage students to rate realistically the relative effec- 
tiveness of their teachers on a true four point scale, 

The median correlation between beginning and end of semester rat- 
ings was shown to be +.60 with individual item correlations ranging 
from -.11 for amount of material learned to +.86 for assessment of 
the course difficulty and the teacher's perceived personality. The 
results of this portion of the study demonstrated that students were 
able to form relatively lasting appraisals of their course and instructor 
after minimal exposure. The stability of the ratings listed in Table 3 
met logical expectations. Although all characteristics of a course can 
be misjudged, those particular characteristics that would be antic- 
ipated to require maximum exposure in order to make a realistic 
judgment, indeed, showed the lowest pretest-posttest correlations 
(Learned —.11; Tolerance to disagreement, .18; Intellectual stimula- 
tion, .20). 

The final portion of this study was designed to determine whether or 
not students in experimental and control groups would give sig- 
nificantly different ratings to teachers who had altered their teaching 
style in two introductory psychology courses. Information in Table 3 
shows that indeed students rated the two styles of teaching differently. 
The life orientated, interest arousing approach did receive sig- 
nificantly average higher mean ratings, г = 2.83 and 4.97 respectively, 
than did the noninterest arousing, basic science/historical approach. 
Nearly all individual ratings were more negative for the rigid non- 
interest approach in both experimental groups. There were no Sig 
nificant differences in the mean ratings at the end of the semester be- 
tween the experimental and control groups for each instructor. | 

That the correlations between the mean item ratings in the experi- 
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PROTESTANT ETHIC ATTITUDES AMONG 
COLLEGE STUDENTS 


L. K. WATERS, NICK BATLIS, AND CARRIE WHERRY WATERS 
Ohio University 


The six scales of the Survey of Work Values (Wollack, Goodale, 
Wijting, and Smith, 1971), the Blood (1969) pro-Protestant Ethic 
scale, and the Protestant Ethic scale of Mirels and Garrett (1971) 
were intercorrelated, and each scale was correlated with Rotter's 1/Е 
scale, SAT total score, and cumulative grade point average for 170 
college students. A factor analysis of the Protestant Ethic scales 
yielded two factors which were interpreted, on the basis of the load- 
ings of the Survey of Work Values scales, as representing intrinsic 
(work-related) and extrinsic (reward-related) aspect of Protestant 
Ethic, The Blood and the Mirels and Garrett scales loaded sub- 
stantially on both factors. Generally, the Protestant Ethic scales 
were negatively related to external orientation on the 1/Е scale, and 
were unrelated to SAT scores and academic performance. 


RELATIVE to the Protestant Ethic, Wollack, Goodale, Wijting, and 
Smith (1971) have recently stated: 

The principal aspects of the Protestant Ethic as described by 
Weber (1958) are individualism, asceticism, and industriousness. 
The emphasis placed on a man’s industriousness probably repre- 
sents the most critical aspect of protestant Ethic. The Ethic has 
been assessed typically by indirect methods that have been 
presumed to index this concept... 

Correlations between attitudes and supposedly logically related 
behaviors have usually been found to be low. Frequently, a variely 
of considerations intervene to inhibit the behavioral manifestations 
of an attitude. Economic and social factors may greatly limit the 
alternative behaviors available to an individual regardless of his at- 
titudes. It would, therefore, be naive to expect indirect measures to 
index accurately an individual’s work values. Attitude scales 
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should provide more direct measures of these concepts [р.. 
Within the past few years three scales have been develop 
measure Protestant Ethic attitudes, and validity data have 
reported for each of the scales (Blood, 1969; Mirels and Garrett, 1 
Wollack, Goodale, Wijting, and Smith, 1971). j 
The purpose of the present study was two fold: (a) to determine, 
a college sample, the interrelationships among the three Prote 
Ethic scales, and (b) to examine the relationships of each of the t 
scales to selected personality, ability, and academic performa 
measures. 


Method 


АП scales were administered in booklet form to 102 males and 
females enrolled in introductory level psychology classes. Items fro 
the three Protestant Ethic scales were combined into one section of t 
booklet, and each item was responded to on a 7-point agree/disagr 
scale. 

One Protestant Ethic scale (PE-B) consisted of the four pro-Protes- 
tant Ethic items used by Blood (1969) in a study of work values and 
job satisfaction among military personnel. The second Protestant. 
Ethic scale (PE-MG) was that developed by Mirels and Garrett (1971) 
with samples of college students. This scale consists of 19 items. Each. 
of these scales yields a single overall score. The third Protestant Ethic 
scale was the Survey of Work Values (SW V) developed by Wollack, et 
al. (1971). The SWV has 54 items with nine items covering each of t 
six areas; Activity Preference (AP), Job Involvement (JI), Pride 
Work (PW), Social Status of Job (SS), Upward Striving (US), and At | 
titude toward Earnings (АЕ). The first three subscales of SWV герге 
sent three dimensions of Protestant Ethic that cover intrinsic aspects 
of work; whereas the latter three subscales are intended to reflect € 
trinsic aspects of the Protestant Ethic. 

The personality measure included in the booklet was Rotter's (19 
internal/external control of reinforcement scale (1/Е). The two other 
indices were taken from student records: the total score of the 
Scholastic Aptitude Test (SAT) of the College Entrance Examinati 
Board, as a measure of ability, and cumulative grade point average 
(GPA) as a measure of academic performance. 


Results and Discussion 


The correlations of the Protestant Ethic scales with each other а 
with the personality, ability, and academic performance measures 
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given in Table 1. Coefficient Alpha reliability estimates are presented 
in the diagonals for the Protestant Ethic scales. 

The extrinsic subscales of the SWV had internal consistency 
reliability estimates similar to and the intrinsic subscales had reliability 
estimates somewhat higher than those reported by Wollack, et al. 
(1971) for industrial and government workers. The PE-MG scale's 
reliability was almost identical to that reported by Mirels and Garrett 
(1971) for the college students. Blood (1969) did not report а reliability 
estimate for his pro-Protestant Ethic scale, but the obtained .71 соећ- 
cient seems quite sufficient for research purposes (and quite high 
considering the scale has only four items). 

A principal components factor analysis of the correlations among 
the Protestant Ethic scales (using 1.00 as the eigenvalue cutoff and a 
Varimax rotation) yielded two factors which were consistent with the 
two-factor intrinsic. (work-related) and extrinsic (reward-related) 
dichotomy. On the first factor, loadings of .85, .77, and .89 were ob- 
tained for the SWV intrinsic subscales AP, Л, and PW, respectively. 
The SWV extrinsic subscales SS (.77), US (.70), and AE (.80) defined 
the second factor. Both the Blood and the Mirels and Garrett scales, 
loaded in the middle to high .50's on both factors. 

The intrinsic subscales of the SWV, the РЕ-В, and the PE-MG 
scales all correlated significantly (p < .01) and negatively with the ИЕ 
scale—an outcome indicating that persons scoring high on these PE 
scales tended to perceive their own efforts and abilities, rather than 
luck or fate, as determining the course of events in their lives. This 
result is consistent with the findings of Mirels and Garrett (1971) for 
college students. 


TABLE 1 
Correlations of Protestant Ethic Scales with Each Other and Measures of Personality, 
Ability, and Academic Performance" 


Measures AP Л PW SS 05 


AE РЕ-В РЕ-МО І/Е 


БАТ ОРА 


АР т) 52* mE ET SNe ae ОЗ: 120 
i it 69) ШУ ИОНОВ О зө Sate aE 100 -% 
Ew Gey LA 368. 1800491, САТ =e 0. |! 
55 о Mm 15 -0 
ДЕ (Өмен и.о 
АЕ (СУЙЫК О 2 72037 2, 716 
РЕ-В (іу) 70 -5* 03 -0 
PE-MG б) -34 0-12 

45.64 48. ов о ШЕЛІ 11,19, 8109. 129-9409 2, 
ЗВ 2216 UN ево АЛТ BO 44418262171 


Decimal points omitted from the correlation coefficients. 


B 
The names and descriptions of the measures are presented in the text- 
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With the exception of the SWV extrinsic subscales, the PE scales 
were almost completely unrelated to ability as measured by SAT total 
score. Although all three SWV extrinsic subscales were negatively 
related to SAT, only AE correlated significantly (p < .01). None of the 
PE scales was significantly related to academic performance indexed in 
terms of cumulative GPA (with or without SAT total partialed out). 
Mirels and Garrett (1971) found, in terms of interest patterns on the 
Strong Vocational Interest Blank, that PE attitudes were positively 
related to occupations requiring a concrete, pragmatic approach to 
work and negatively to SVIB scales for occupations which to a greater 
degree require emotional sensitivity, theoretical interests, and 
humanistic values. With the wide range of contemplated majors in a 
relatively unselected group of freshman and sophomore students, it 
seems reasonable that Protestant Ethic scales did not correlate with 
GPA. Perhaps PE scales would show differential correlations with per- 
formance for specific majors. 
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PREDICTIVE VALIDITY OF THE AMERICAN UNIVERSITY 
OF BEIRUT TRIAL APTITUDE BATTERY 


Е. К. ABU-SAYF AND GEORGE I. ZA'ROUR 
American University of Beirut 


On the basis of student scores on a battery of tests given at the time 
of their admission to the American University of Beirut and their 
subsequent college grades, the validity of the three new tests for 
predicting college grades was obtained. The results in general sug- 
gested a low to moderate predictive validity for each of the tests. Sug- 
gestions for test revision were based on item analysis and on classifi- 
cation of the items in terms of both the factors of science aptitude 
and the subject matter each item represents. 


IN the fall of 1971, two essentially unspeeded 50-item, 5-choice 
forms of each of three new tests were administered to newly enrolled 
students at the American University of Beirut (AUB). These tests 
were: the English Proficiency Test (EP), the Quantitative-Aptitude 
Test (AQ), and the Science-Aptitude Test (AS). The purpose of this in- 
vestigation was to find the predictive validity ofeach of the tests and of 
weighted combinations of them with grade-point average (GPA) and 
performance in selected courses as criteria. Special emphasis was given 
to AS in view of its particular relevance to science curricula that are 
emphasized at AUB. 

Methodology 
-order and multiple correlation coefficients 


For the most part, zero 
d of 275 freshmen 


were calculated. The experimental sample consiste 

and sophomores (except as otherwise noted). 
Results 

The Test Battery in General 


The intercorrelations among scores on the three tests showed that 
AS overlapped with AQ to a relatively large extent and with EP toa 


Copyright © 1975 by Frederic Kuder 


451 


452 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


smaller extent, whereas the overlap between EP and AQ was only 
moderately high. These values, like all the correlation coefficients com- 
puted in this project, were not corrected for restriction of range. 

Product-moment correlation coefficients between scores on these 
tests and selected performance criteria suggested that the prediction of 
first-semester GPA was generally low to moderate, although the coeffi- 
cients of correlation were generally significantly different from zero. 
AS predicted grades in Biology 201 more nearly accurately than grades 
in other science courses; AQ was a valid predictor of grades in 
mathematics courses; and EP predicted grades in English 201 at a 
higher level of accuracy than in other English courses. 

Multiple correlation coefficients between possible combinations of 
these tests and first-semester freshman and sophomore GPA are 
presented in Table 1. These values were not appreciably higher than 
were the zero-order correlations of single tests with the same criterion. 
When first-year GPA was used as the criterion (instead of first- 
semester GPA), the data shown in Table 2 were obtained. Since cross- 
validation data were not available, shrinkage in these values would be 
expected to occur, especially where the numbers of cases were small. 
The underlined values were recommended as most efficient for prac- 
tical considerations. It is also interesting to note the appreciable drop 
in the coefficients whenever EP was absent from a certain combina- 
tion, 

The point-biserial correlation coefficients obtained from Иет- 
analysis of each test using first-semester GPA as criterion were low. 
Consequently, the minimum value for an item to be left unchanged 
was taken to be 0.15. 


TABLE 1 
Multiple Correlation Coefficients between Either First-Semester Freshman 
or First-Semester Sophomore GPA as Criterion Variables and 
Combinations of Scores on EP(1), AQ(2), and AS(3) 
as Predictor Variables 


Predictor 
Variables Subjects N т 
1,2 Егезһтеп 73 35 
Sophomores 202 37 
1,3 Freshmen 73 33 
Sophomores 202 33 
23 Freshmen 73 28 
Sophomores 202 30 
1,2,3 Freshmen 73 .35 
Sophomores 202 38 


* All coefficients significant beyond the .01 level, 


— 
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TABLE 2 
Correlation Coefficients between First-Year GPA(1) and EP(2), AQ(3), and AS(4) 
and Combinations of These Predictor Variables 


Correlation Arts Bus, Adm. Science Total Sample 
Coefficient (М = 75) (N = 24) (М = 174) (М = 271) 
та Sieh .50* .56** pH 
Ға .05 38 .56** saaa 
Tu .08 104 46% 242% 
Russ .58** 53 66% 5 
Rin .58** .51 .60** 19/54 
Rise 108 43 .62** ЭЛ. 

К, зза :59%% 57 1675 .58** 
*p« 05. 
p< Ol. 


Note—The underlined values denote recommended test combinations. 
The Science-Aptitude Test in Particular 


Based on the Kuder-Richardson formula (21), the coefficients of in- 
ternal consistency of the two forms of AS, coded 885 and 886, were 
found to be .81 and .80, respectively. Form 886 proved to be a slightly 
more valid predictor of GPA than was Form 885. Critical ratios 
between means suggested that science students did perform signifi- 
cantly better than did arts students (p < .01). Form 886 was more diffi- 
cult than Form 885 (p < .01). Ten items out of 50 were found by in- 
spection to be equivalent in both forms, whereas four pairs were 
equivalent within Form 886. ў 0 

Content validity. The items in AS were classified in eight categories 
judged by previous investigators to define science aptitude: (1) tend- 
ency to suspend judgment when evidence 15 insufficient; Q) ability and 
accuracy in designing and defining; (3) creativity; (4) previous scientific 
experience; (5) an experimental bent; (6) specialized curiosity; (7) ac- 
curacy in reasoning and interpreting data as in the ability to evaluate 
and detect inconsistencies; and (8) mechanical ability. The items were 
then further classified under chemistry, physics, biology, { physical 
chemistry and chemical physics (nuclear and atomic chemistry and 
physics), or general science (astronomy and space and earth sciences). 
As a result of this work, the items in each form of AS were arranged in 
а presumed order of decreasing validity. This ordering was used in 
revising the tests. t 

Comments regarding the potential validity of AS. The maximum cor- 
relation that could be attained between scores on any predictor test (0 
and scores on any criterion (с) would be 4/Tucc » in which тіз 
estimated reliability of the test and rec is the estimated reliability of the 
criterion variable. If the test yields perfectly reliable scores and if the 
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criterion variable (in this case GPA) is expressed as having an av 
estimated reliability coefficient of about .81, the maximum va 
coefficient that could be obtained is .90. This value is the upper limi 
for the predictive validity of Test AS which would be attained if thi 
test were perfectly reliable and measured all the true variance of GP Y 
In practice, the validity of the test could most readily be increased b 
selecting for use, from item analysis procedures, those items which | 
were found to be correlated highest with the external criterion varial 
and lowest with total test scores (Gulliksen, 1950, p. 380 Й.), 
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VALIDITY OF AWARDING COLLEGE CREDIT BY 
EXAMINATION IN MATHEMATICS AND ENGLISH' 


CAROL KEHR TITTLE 
Queens College, City University of New York 


MAX WEINER 
Graduate School and University Center, City University of New York 


FRED D. PHELPS 
Lehman College, City University of New York 


The present study was concerned with the validity of the College- 
Level Examination Program General Examinations (CLEP) іп 
Mathematics and English Composition. The two-fold purpose of the 
study was to provide an estimate of the number of credit hours likely 
to be earned if students took CLEP as well as to examine the inter- 
relationships among the two previously mentioned subtests from 
CLEP, an end-of-year achievement test in Mathematics, and a prior 
measure in English Composition. First-year students at a senior col- 
lege of the City University of New York were recruited for an ех- 
perimental administration of the CLEP, the final examination from 
the first year mathematics course, and a college-developed English 
placement essay. Students with high scores on the American College 
Testing Program Examination (ACT) and high school averages were 
predominant in the sample selected for testing. From the data, in- 
ferences were made that (1) the CLEP Mathematics test could be 
used to grant credit in mathematics, but that the current cutting 
score should be examined in view of standards used in the course; (2) 
there was little relationship between CLEP English Composition 
scores and present college placement procedures for first-year 
English; and (3) the number of students who could earn college credit 
by examination was much higher than was the number presently tak- 


ing CLEP at the college. 


THE use of examinations to grant credit for college level courses has 
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been argued for on the basis of benefits to students (either in 
panding or accelerating their educational experiences) and to ins 
tions (in terms of allocation of staff and possible reduction of cos 
The present report is the first one in a study of credit-by-examinati 
at the City University of New York (CUNY), with one of the senior 
colleges agreeing to participate. Its dual purpose was to proyide, fora 
sample of college freshmen, an estimate of the number of credit hours 
likely to be earned if students took the College-Level Examination 
Program General Examinations (CLEP) and to ascertain the degree of. 
interrelationships of scores on each of two CLEP subtests | 
(Mathematics and English Composition), of performance on ап end- 
of-year achievement test in mathematics prepared by faculty members | 
(but administered at the start of the fall semester), and of standing опа | 
placement test (writing sample) in English administered three months | 
prior to fall enrollment in freshman classes. 


Methodology 


Letters to recruit volunteers to take CLEP were sent to 240 students. 
These letters were directed primarily to students in the upper part of 
the composite score distribution of the American College Testing | 
Program Examination (ACT) and in the upper part of the distribution - 
on high school average (HSA). Of this original group of 240 students, 
181 agreed to participate; complete data for analyses were available : 
for 171 students (69 males and 102 females). The ACT and HSA da 
for the 171 students were: ACT score range from 14 to 32, mean— 
23.8, standard deviation—3.46; HSA range from 73 to 96, mean: 
85.6, and standard deviation—5.61. After being tested September 8 
and 9, 1973, students were granted college course credit on any CLE 
test on which they had placed at or above cutting scores the colleg 
had established. 

The tests administered were: CLEP General Examinations (5 $60 
in English Composition, Natural Sciences, Mathematics, Humani! 
and Social Sciences—History; and a faculty constructed final examin 
tion for the college first year mathematics course, (ЕМЕ). Engli 
writing samples were available from a June 1973 placement 2 
devised at the college for first-year English (placement in English 1 
ог 102); these essays were rescored іп the fall, 1973, and given a grae 
of A, B, C or D by one of three raters. 

The CLEP апа ЕМЕ were given on 2 days, with 2 testing orde. 
domly assigned (balanced for the ЕМЕ): Group A took ЕМЕ, 
CLEP Books I and II; Group B took CLEP Book I (English Compost: | 
tion, Natural Sciences, Mathematics), FME, then CLEP Boo 
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(Humanities and Social Sciences—History). Examination of scores as- 
sociated with test order (A,B) and date of testing (Saturday, Sunday) 
by analysis of variance revealed no significant main order or interac- 
tion effects for CLEP English Composition and Mathematics scores, 
or for the FME scores. These data from groups A and B and from two 
testing dates were combined for the remaining analyses. 


Results and Discussion 
Mathematics 


Correlations of the FME raw scores and grades (ranging from A to 
F) with CLEP Mathematics scores were .62 and .58, respectively. Cor- 
rection for restriction in range, which was based on the use of ACT 
scores for the full first-year class of 1972 (mean of 16.5 and standard 
deviation of 5.35), resulted in only slightly higher correlation coeffi- 
cients of .64 and .62, respectively. The recommended cutting score on 
CLEP Mathematics (495) was contrasted with grades of C or higher 
on the ЕМЕ. This analysis showed that all those earning 0 credit on 
CLEP received a mark of D or lower on ЕМЕ (М = 51). There were no 
students who obtained a grade of C or higher on FME and also earned 
zero CLEP credit hours. Of those receiving 6 credit hours in 
mathematics on the basis of CLEP, 85 (71%) received a grade of D or 
lower on FME, and 35 (29%) earned а grade of C or higher. 

Although the correlation of performance on the CLEP and FME 
was within the range of expected validity coefficients, the cutting score 
might need to be adjusted for the course standard. Examination of the 
scatterplot of FME grades and CLEP Mathematics showed that а 
grade of F was associated with a CLEP Mathematics score range from 
361 to 604. Adjusting the cutting scores to reduce the proportion of 
those receiving CLEP credit and earning grades of D or lower would 
increase the numbers of students receiving C or higher grades who also 
earned zero CLEP credit. For example, cutting scores are shown in 
Table 1. Increasing the cutting score required for receiving credit could 
effect a considerable reduction in the numbers receiving grades of D or 
lower as compared with the frequency associated with the cutting 
score employed. The number of students who would have received a 
grade of С or higher on the basis of the FME but would not have 
received 6 credits on CLEP went from zero with the present cutting 
score of 495, to 3, 7, and 9 with use of higher cutting scores of 527, 543, 
and 559, respectively, on CLEP. Cutting scores would need to be es- 
tablished locally (and cross-validated) where grades on college final ex- 
aminations for courses indicate а different standard from that es- 
tablished by use of CLEP recommended cutting scores. 
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TABLE 1 
A Comparison of Number of Students with Zero or Six CLEP Credit Hours т 
Mathematics and with Faculty-Assigned Grades of C or Higher and D or Lower as 
CLEP Mathematics Cutting Scores Are Increased 


Number of students receiving zero Number of students receiving six 
CLEP credit hours CLEP credit hours 
CLEP Number with CLEP Number with 
Mathematics gradeofC Total Mathematics gradeofD Total 
Cutting Score or higher number Cutting Score or lower number. 
495 or higher 0 (096) 51 495 or higher 85 (71%) 120 
527 ог һірһег 3(4%) 83 527 or higher 56 (64%) 88 
543 or higher 7(7%) 98 543 ог һірһег 45(62%) 13 
559 or higher 9 (8%) 115 559 or higher 30 (54%) 56 
English 


The correlation between CLEP English Composition and grades on 
the English placement essay was .24 (.26, when corrected for restric- | 
tion in range). This figure was consistent with the contingency coeffi- 
cient of .22 computed for the four-fold table (receipt of either 0 or 6 - 
hours CLEP credit vs. placement in either English 101 or English 102), 
the entries for which were based on readings of the essay the prior June 
(chi square = 8.35, p = .004). These correlational results were lower 
than would ordinarily be useful for a study of the validity of granting 
credits based on examination. 


Course Credits 


The number of students receiving 0 or 6 course credits for each of | 
the 5 CLEP General Examinations was computed. Six hours of course 
credits were given students on the basis of cutting scores as follows: 
English Composition, 495; Natural Sciences, 485; Mathematics, 495; 
Humanities, 468; Social Sciences—History, 470. The percentage of 
students receiving credit ranged from 46 in English Composition to 70 
in Mathematics; 51% received 6 hours of credit in Humanities, 53% in | 
Social Sciences—History and 589» in Natural Sciences. Only 14 stu- 
dents (8%) of the 171 students failed to earn any course credit. Sixteen | 
per cent (28 students) earned the maximum possible 30 credits, 
equivalent to a full year's credit. The rest of the distribution was: 2 
students (17%) earned 6 hours credit; 26 (15%), 12 hours of credit; 4 
(2595), 18 hours of credit; and 32 (19%), 24 credit hours by examina- | 
tion, Y 

The number of students voluntarily taking CLEP prior to entering | 
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the college (i.e., not recruited for this study) was 40. The results of the 
study clearly showed that many more students were capable of taking 
and receiving course credit on the basis of CLEP. The implications for 
the college enrollment were estimated by computing full time 
equivalent student (FTE) on the basis of credits earned by these 171 
students. The total of 2850 credits (divided by 30 credits per student to 
obtain the number of FTE's), yielded the equivalent of 95 FTE stu- 
dents. Using a cost of $1710 per FTE (instructional costs only), sav- 
ings for 95 FTE students would be $162,450. Projected for ten senior 
colleges of CUNY, savings would be $1,624,500 annually. As gross es- 
timates, these figures do not take into account a full cost analysis that 
would be required if large numbers of students were to earn credit by 
examination. 


Conclusions 


The data reported in this investigation have indicated that studies of 
the validity of CLEP for granting course credit by examination should 
be carried out on an individual institution basis. Satisfactory validity 
was obtained for CLEP Mathematics, when a college course final ex- 
amination in Mathematics was used as the criterion. However, the cur- 
rently recommended cutting score might be too low, when examined 
against faculty standards. Similar validity was not demonstrated for 
CLEP English Composition, when a college English placement essay 
was employed as the criterion measure. The well-known difficulty in 
obtaining reader reliability for English essays undoubtedly con- 
tributed to the low correlations reported. ; 

Comparison of the numbers of students voluntarily taking CLEP 
with the results of credits granted by examination in this study in- 
dicated that there would be a large potential group of students who 
could take the CLEP and earn college credit. This study did not at- 
tempt to anticipate long-range effects of encouraging credit by ex- 
amination. Two of the many possible outcomes of granting college 
credit on the basis of CLEP scores would include shorter time In col- 
lege and an alteration of the ratio of beginning to advanced courses 
Within academic departments. These and other results should be ex- 
amined in further testing of the validity of a program such as CLEP. 
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PREDICTION OF FIRST QUARTER FRESHMAN GPA 
USING SAT SCORES AND HIGH SCHOOL GRADES 


BRAD S. CHISSOM anp DORIS LANIER 
Georgia Southern College 


The study attempted to determine the validity of students’ SAT 
scores and HSGPA as predictors of freshman course grades and 
overall college grade point average (CGPA). Subjects for the study 
included 669 freshman students at Georgia Southern College who 
had enrolled in and completed either English Composition | or fresh- 
man mathematics during fall quarter, 1973. Data for the study in- 
cluded students’ SAT-V scores, SAT-M scores, HSGPA and CGPA. 
Results showed that a significant multiple correlation existed be- 
tween the predictor variables and CGPA. 


Many colleges and universities have adopted the College Entrance 
Examination Board’s Scholastic Aptitude Test (SAT) as one criterion 
for selecting candidates for college admission. In many instances а stu- 
dent's SAT score, in combination with his high school grade point 
average (HSGPA), is the basis Гог his acceptance or rejection by the 
school of his choice. Since college admissions officers place a great deal 
of emphasis on a student's HSGPA and SAT score, continuous efforts 
Should be made to determine the predictive validity of these scores. 

Most researchers have found a correlation between first quarter or 
first year grades of college students and (a) high school grade point 
average (HSGPA) and (b) SAT scores. For example, Franz, Davis, 
and Garcia (1958) obtained а substantial correlation between first 
quarter grade averages of students and each of two cognitive predic- 
tors: HSGPA and SAT scores. Several investigators have concluded 
that HSA was а more valid predictor of college success than were SAT 
scores (Franz, Davis, and Garcia, 1958; Mann, 1961; Michael and 
Jones, 1963; Spaulding, 1959). Among the studies concerned with 
grade predictions in specific subject areas, Passons (1967) concluded 


Conus 
9Pyright © 1975 by Frederic Kuder 
461 


462 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


that, although high school achievement was the most predictive | 
dicator of future over-all college success, test scores were slightly m 
valid than was high school standing for predicting grades іп specii 
courses. In another investigation Brown and Lightsey (1970) obtain 
correlations of .54 and .50 for men and women, respectively, betw 
SAT verbal (SAT-V) scores and English course grades. More | 
cently, Lanier and Lightsey (1972) found a correlation of .74 between 
SAT-V scores and English grades and of .66 between HSGPA and col- 
lege English grades. In conclusion, most correlational studies have re- 
vealed a substantial relationship between cognitive predictor variables 
and college achievement. 


Purpose and Method М 


The purpose of this study was to determine the validity of students’ 
SAT scores and HSGPA as predictors of freshman course grades and 
overall college grade point average (ССРА) at Georgia Southern Col- 
lege. It was hypothesized that a positive correlation would exist be- 
tween (1) SAT verbal scores (SAT-V) and CGPA, (2) SAT mathe- 
matics scores (SAT-M) and CGPA and (3) HSGPA and CGPA. T 
subjects for the study included all freshman students (№ = 669) who 
had enrolled in and completed either English Composition | or fresh- 
man mathematics during fall quarter, 1973, at Georgia Southern Col- 
lege. The following data were obtained from each student's record: (1) 
SAT-V scores, (2) SAT-M scores, (3) HSGPA, and (4) CGPA at the | 


end of the fall quarter. Grade point average was based on a 4-point- 
scale, 3 


Results and Discussion 


Intercorrelations among the four variables along with means and 
standard deviations are included in Table 1. All three predictors cor- 
related significantly with the CGPA criterion measure. 


| ТАВГЕ 1 
Intercorrelations of Predictor and Criterion Variables Along with 
Their Means and Standard Deviations ш 
и 
Variables Н 


1 2 3 4 
1. SAT-V ры .49* 2ТЕ 227° 
2. ЗАТ-М - 410% 5397 
3. НЅА HN ` 46% 
4. CGPA 


* Significant at 01 level (R > Пір < 01). 
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TABLE 2 
Summary Table for Step-Wise Multiple Regression 


Variable Multiple Increase 
Entered R R’ in R* 
1. HSGPA ER 20 20 
2. SAT-M 55 30 10 
3, SAT-V .57* 32 02 


* F = 108.244 (df = 3,665), p < 01. 


Results of the step-wise multiple regression analysis are presented in 
Table 2. The numbers of the variables indicate the order in which they 
entered the prediction equation. The overall multiple correlation co- 
efficient of .57, which was statistically significant beyond the .01 level, 
was comparable with coefficients obtained from other studies using the 
same three types of predictor variables. The largest contribution to the 
relationship was made by HSGPA followed by the SAT-M and SAT- 
V variables. It would appear that SAT-M was weighted more heavily 
for predicting CGPA for this group of subjects than was the SAT-V, 
but that the weighted combination of the three variables indicated 
limited validity for prediction of CGPA from either the SAT-V or the 
SAT-M. It was evident that HSGPA was the most valid predictor. 
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THE RELATIONSHIP BETWEEN ACADEMIC APTITUDE 
AND OCCUPATIONAL SUCCESS FOR A SAMPLE 
OF UNIVERSITY GRADUATES 


JOHN LEWIS 
Winona State College 


The academic aptitude at the time of admission to college and level 
of occupation later in life were collected for 619 male college 
graduates. The statistically significant relationship indicated that the 
graduates with higher aptitude scores as compared with those with 
lower scores were more likely to report higher level occupations. 


Tuis study was undertaken to determine whether academic aptitude 
at the time of admission to college was related to occupational success 
later in life among 619 male graduates. 


Procedures 


Each subject’s academic aptitude was defined as his composite score 
on the Iowa Placement Tests. Designed to predict academic achieve- 
Ment in college, this battery of tests was given to all students who 
entered the University of lowa during the years prior to the develop- 
ment of the American College Testing Program Examinations. Oc- 
cupational level was determined by the subjects’ responses to question- 
naires in the late 1960's that were mailed by the author and the Uni- 
versity of Iowa Alumni Office. Roe's (1956) classification scheme was 
used to determine the level of each graduate's occupation. The subjects 
Were 619 male graduates of the University of Iowa College of Liberal 
Arts in the academic years 1948-49, 1954-55, and 1959-60 who had 
majored in the areas of general humanities, social science, natural 
Science, or journalism and for whom complete data could be found. 

Findings 

The classification of the graduates by academic aptitude and later 
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TABLE 1 
Ranges of Scores in Terms of Percentile Ranks on Freshman Test Battery 
and Corresponding Occupational Levels by Percentages of Individuals 
within These Test Score Ranges 


Occupational Percentile Ranks in Test Scores 
Level 01-49 50-74 75-89 90-99 
1 9 16 17 24 
2 72 74 72 66 
3 19 10 11 10 
N 185 158 143 133 
X? = 20.57 df=6 р< .01 


occupational level is presented in Table 1. Larger percentages of the 
graduates who had earned higher scores on the admissions tests as 
compared with those who had obtained lower scores reported occupa- 
tions with level 1 classification. Likewise, larger percentages of 
graduates with low scores as compared with those who had received 
high scores reported level 3 occupations. For example, 9% of the 
graduates with admission test scores from the first to the forty- 
ninth percentile Tange reported level 1 occupations whereas 24% of 
those within the ninety to the ninety ninth percentile range on the tests 
reported level 1 occupations. The chi square value of 20.57 was 
Statistically significant beyond the .01 level. 


Conclusion 


These results show that the ability to achieve high scores on a typical 
college admissions test was related to occupational success among 
those graduates who responded to the questionnaire when 0C- 
cupational success was defined as the prestige level of their occupation. 
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PREDICTING ACHIEVEMENT IN AN UPPER-DIVISION 
BACHELOR’S DEGREE NURSING MAJOR 


JOHN LEWIS лмо MARGARET WELCH 
Winona State College 


A study of the correlations between objective background 
variables and achievement in an upper-division bachelors degree 
program in nursing revealed significant correlations for grade point 
average in required college pre-nursing courses, grade point average 
in elective college pre-nursing courses, and rank in high school 
graduating class. The results of a multiple regression analysis showed 
that grade point average in required pre-nursing courses was the only 
variable to yield a significant regression weight. 


THE purpose of the study was to determine (1) what the degree of 
relationship between selected objective background variables and 
academic achievement in an upper-division bachelors degree program 
in nursing would be (2) which of these background variables would 
add to the efficiency of a regression function for predicting achieve- 
Ent in this nursing curricula, and (3) what the accuracy. of this op- 
imum regression function would be. 


Procedures 


ТЕР Subjects were 104 juniors and seniors in the bachelors degree 
back, ng program at Winona State College in Winona, Minnesota. The 
eal aaa variables selected for investigation were grade До 
м required college pre-nursing courses taken prior to forma 
colle, ation to the nursing program, grade point average іп elective 
Б С courses taken prior to formal application to the nursing 
VU composite standard score on the American College Testing 
and (Де шаншу (ACT), rank in high school graduating e 
point % pumber of elective college credits. The criterion was grade 
verage in six core nursing courses typically taken as juniors. 
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TABLE 1 
Correlations among Grades in Required Pre-Nursing Courses (GPAR), Grades in 
Elective Pre-Nursing Courses (GPAE), Composite Standard Score on the ACT 
Tests (ACT), Rank in High School Graduating Class (HSR), and Number of 
Elective College Credits (NEC), and Grades in 
Criterion of Core Nursing Courses (GPAC) N = 104 


GPAR | GPAE ACT HSR NEC GPAC 
GPAR 1.00 41** m 45** 104 44% 
GPAE 41% 1.00 21% 27% —.19 25 
АСТ 44 21% 1.00 40% 07 12 
HSR 45%" 27" 40" 1.00 —.12 28% 
МЕС 04 -,19 07 -42 1.00 00 
GPAC 44% 25% 12 28 00 1.00 
*p« 05. 
p< Ol, 
Results 


The correlations among all variables are presented in Table | and 
the results of a multiple regression analysis are shown in Table 2. 
Three objective background variables, grade point average in required 
college pre-nursing courses, grade point average in elective college pre- 
nursing courses, and rank in high school graduating class yielded 
Statistically significant correlations with grades in the upper division 
core nursing courses. The relatively large correlation of .44 between 
grades in the required pre-nursing courses and later success in the 
nursing program probably reflects a high degree of relevance between 
these required pre-nursing courses and the core courses in the nursing 
major. The remaining background variables of composite standard 
scores on the ACT tests and of total number of elective college credits 
did not correlate significantly with grades in the nursing major. 

Reference to the multiple regression data present in Table 2 reveals 
that grade point average in required pre-nursing courses was the only 
background variable which yielded a statistically significant (p < 01) 
regression weight. The other four predictor variables, which had non- 


TABLE 2 
Raw Score Regression Weights (b), Standard Score Regression Weights (B) 
and Multiple Correlations (R) for Each of the Background Variables 


b B R 
GPA Required Pre-Nursing Courses .425** .409** 44 
High School Rank .348 .125 45 
American College Tests —.019 —.126 46 
ОРА Elective Pre-Nursing Courses .069 1077 47 
Number of Elective Courses .001 .021 47 


**p < 01. 
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significant (p > .05) regression weights, added little to the accuracy of 
a regression function. The single-order correlation between grade 
point average in required pre-courses and grades in the nursing 
program was .44 compared with a multiple correlation of only .47 
when all five of the variables were used as predictors. 


Summary 


This study was undertaken to determine an optimum method of 
predicting academic achievement in an upper-division bachelors 
degree program in nursing. The findings of the study showed that 
Brade point average in required pre-nursing courses, grade point 
average in elective pre-nursing courses, and rank in high school 
graduating class correlated significantly with a criterion of achieve- 
ment in this nursing program. However, a multiple regression analysis 
of the data revealed that grade point average in required pre-nursing 
courses was the only variable to yield a significant regression weight. 
Optimum prediction could thus be made using this one background 
variable as a predictor. 
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PREDICTIVE VALIDITY OF THE GRADUATE RECORD 
EXAMINATION AND THE MILLER ANALOGIES TESTS 


JOHN L. NAGI 
Hudson Valley Community College 


For a sample of 63 graduate students, 33 of whom did complete 
and 30 of whom did not complete a doctoral program in Edu- 
cational Administration at the State University of New York at Al- 
bany, statistically nonsignificant point biserial coefficients of 0.140 
and 0.087 were determined respectively for the total scores on the 
aptitude portion of the Graduate Records Examination and scores 
on the Miller Analogies Test relative to the dichotomous criterion of 
completion or lack of completion. 


Tuis study was intended to determine the validity of the total score 
on the Graduate Record Examinations Aptitude Test (GRE) and of 
the score on the Miller Analogies Test (MAT) as predictors of a 
criterion of completion or noncompletion of the Doctoral Program in 
КЕШЕДІ Administration at State University of New York at 

any. 


Source of Data 


The data required to determine the validity of the GRE and MAT 
Scores as predictors of program completion were gathered by examin- 
ing the student records kept by the Graduate Admissions office of the 
School of Education of the State University of New York at Albany. 
This examination provided GRE and MAT scores for 33 students who 
had completed the Doctoral Program and 30 who had not. 


Results and Discussion 


between GRE scores and a 


To determine the degrees of relationship 
letion of the program 


Criterion variable of completion or поп-сотр 
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and the extent of association between MAT scores and the same cri- 
terion variable, point biserial coefficients of correlation were com- 
puted. The respective coefficients of 0.140 (N = 63) and of 0.087 (N = 
63) for the GRE and MAT predictor variables failed to reach sta- 
tistical significance at the .05 level. 

Several prior studies have indicated that the usefulness of the MAT 
in predicting success in graduate school has not been noteworthy, (Gill 
and Marascuilo, 1967; Hall and Robertson, 1964; Hyman, 1957; and 
Platz, McClintock, and Katz, 1959). In a study at Utah State Univer- 
sity, Borg (1963), determined that the GRE was of little value in pre- 
dicting completion of a doctoral program and obtaining a degree. The 
present study appears to bear out earlier findings that the GRE and 
MAT are not substantially valid predictors of program completion. 
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THE PREDICTIVE VALIDITY OF THE WPPSI 
WITH ISRAELI CHILDREN’ 


AMIA LIEBLICH AND MAYA SHINAR 


The Human Development Center 
The Hebrew University of Jerusalem, Israel 


Sixty-two first grade children in Israel were tested on the WPPSI, 
and most of them were retested a year later using objective measures 
of Reading and Arithmetic. The scores of the intelligence and 
achievement tests were correlated to provide estimates of the predic- 
tive validity of the WPPSI. Results revealed the Israeli WPPSI to be 
highly valid for prediction of school achievement. 


Since the publication of the WPPSI (Wechsler, 1967) several 
evaluation studies were carried out focussing on its relationship to 
other measures of intelligence (Yule, Berger, Butler, Newham, and 
Tizard, 1969; Zimmerman and Woo-sam, 1970) and on its predictive 
validity (Rankin and Henderson, 1973; Kaufman, 1973). The validity 
of the test as a predictor of early school achievement was rather satis- 
factory in the United States with middle-class children (Kaufman, 
1973), but nonsignificant with disadvantaged Mexican-American chil- 
dren (Rankin and Henderson, 1973). Н 
3 The purpose of the present study was to assess the predictive valid- 
Ну of the WPPSI in Israel, through use of a standardized Hebrew ver- 
Sion of the test (Lieblich, 1971). 


Method 
Sample 


Sixty-two first grade children, their ages ranging from five and a half 
to six and a half years, participated in the first stage of the study. They 
"Тһе authors wish to thank Dr. David Wechsler for his constant help and support, 
and Mrs. M. Bassok for carrying out the individual examinations. 
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were randomly sampled from the first grade pupils in a public school 
in an urban middle-class area. Fifty-four of these children were re- 
located eighteen months later for the second stage of the project. 


Instruments and Procedure 


In the first stage, the Hebrew WPPSI was administered individually 
during 1973. A year and a half later, 54 subjects who had been re- 
located in the second grade were given Israeli objective group tests of 
school achievements in reading and arithmetic (Ben Shachar and Or- 
tar, 1968; Minkowich, 1973.) 


Results and Discussion 


The WPPSI and the achievement tests were scored and WPPSI 
scores were converted to scaled scores through using the appropriate 
age norms. Correlations of the scaled scores of the subtests, the IQ 
scores of the Verbal, Performance and Total scales were computed 
separately with each of the two achievement criteria. The coefficients 
appear in Table 1. 

The correlation between the reading and arithmetic tests was .67; 
between Verbal and Total IQs, .95; Performance and Total IQs, 51; 
and Verbal and Performance IQs, .74. That all the coefficients in Table 
| were statistically significant (p < .05) indicated that even predictions 
of the criterion measures from the individual subtests were surpris- 
ingly valid, especially for the Arithmetic test. 


TABLE 1 
Correlations of WPPSI Scaled Scores of the Subtests and of Verbal Performance 
and Total 105 with Each of the Criterion Measures: Arithmetic and Reading Tests 


WPPSI subtests Arithmetic Reading 
Information 59 34 
Vocabulary 43 49 

Verbal Arithmetic .54 44 
Similarities 54 63 
Comprehension 45 40 
Animal House Al до 
Picture Completion 45 44 

Performance Mazes 61 44 
Geometric Designs .58 51 
Blocks .57 .26 
Verbal IQ .64 57 
Performance IQ 73 в 


Total IQ 73 63 


LIEBLICH AND SHINAR 415 


Apparently, the predictive ability of the WPPSI in Israel was found 
to be highly satisfactory. The reported correlations were higher than 
the ones found in American studies of the WPPSI, but resembled those 
found with some other mental tests for similar age groups (Kaufman 
and Kaufman, 1972). 

To some degree this high predictability of the WPPSI could be 
hypothesized as being attributable to the heterogeneity of the sample, 
as the school underwent recent integration with a lower-class area. 
Additional validity statistics on the Israeli culture and subcultures are 
needed to determine whether these preliminary findings can be 
replicated. 
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INTERRELATIONSHIPS AMONG PSYCHOLOGICAL 

MEASURES OF COGNITIVE STYLE AND FANTASY 

PREDISPOSITION IN A SAMPLE OF 100 CHILDREN 
IN THE FIFTH AND SIXTH GRADES 


SUSAN McNARY, WILLIAM B. MICHAEL, LEO RICHARDS, 
AND CONSTANCE LOVELL 


University of Southern California 


For a sample of 100 fifth and sixth grade pupils of middle-class 
background (51 girls and 49 boys), intercorrelations among measures 
of three cognitive style constructs of reflection-impulsivity, field 
dependence-independence, and internal-external locus of control and 
three measures of fantasy predisposition were for the most part low 
and statistically not significant. Exceptions were noted in the instance 
of the sample of girls for whom coefficients of .36, .38, and 151 be- 
tween the measure of locus of control and each of three measures of 
fantasy predisposition were statistically reliable beyond the .01 level. 
The hypothesis of a positive relationship between a measurable con- 
struct of fantasy predisposition and each of three measurable con- 
structs of cognitive style received only limited support. Furthermore, 
it did not appear that fantasy was a part of a larger or more general 
construct of cognitive style or that а universal construct of cognitive 
Style existed. 


rable attention has been given 
d imagination in children as 
ve and emotional develop- 
(1973, p. 220) has stressed 


DURING the past several years conside! 
to the study of fantasy, daydreaming, an 
being potentially facilitating to their cogniti 
ment (e.g., Singer, 1973). Recently, Singer 
the need to obtain evidence regarding whether a tendency toward 
‘make-believe play might not be but one component of a more general 
Cognitive style such as reflection-impulsivity (Kagan, Rosman, Day, 
Albert, and Phillips, 1964) or field dependence-independence (Witkin, 
Oltman, Raskin, and Karp, 1971). In fact, another extensively in- 
Vestigated construct of locus of internal-external control, which was 
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derived from social learning theory (Rotter, 1954) and measu 
Bialer (1961), would also appear, at least on the surface, to 
many of the same activities suggested by the construct of 
independence-dependence as well as ап inclinination 10 
autonomous or self-directed behavior patterns such as those inv 
in fantasy. 


Problem 


group of 100 fifth- and sixth-grade children of middle class backg 
the extent of the relationship if any between each of three measures 0 
cognitive style (reflection-impulsivity, field dependence-independen 
and internal-external locus of control) and each of two separi 
measures and a combined measure of fantasy predisposition. Оп! 
assumption that in their operational representation, reflection, field in: 
dependence, and internal locus of contro! constitute positive poles 
whereas impulsivity, field dependence, and external locus of control 
portray negative poles and on the further assumption that rig! 
standing on a measure of fantasy reflects its presence, the followin 
two hypotheses within the context of the theoretical and empirica 
contributions of the previously cited researchers were suggested: | 


Struct of fantasy predisposition and each of three measurable col 
structs of cognitive style. 


2. Positive interrelationships would exist between pairs of mea i 
able constructs of cognitive style. 


High positive interrelationships among all four constructs € 
might suggest that each was a subcomponent of a relatively gener 
ubiquitous construct of cognitive style. A study of the pattern of 
relations among measures would also furnish some informatio 
regarding the extent to which measures of the selected constructs W 
independent of or related to one another as well as predictive 
another. Some inferences about construct validity might also be 
ble. Since many investigators (Singer, 1973; Pulaski, 1970; and 
rance, 1962) have considered imagination and fantasy-relat 
tivities as cognitive skills that reflect many of the same abilities 
flexibility, ideational fluency, and originality found in © 
endeavors, the demonstration of significant relationships b 
measures of the cognitive styles cited and those of fantasy рг 
tion might suggest important implications for nurturing © 
problem-solving skills in children. 
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Sample 


Residing in a middle-class suburban community near Los Angeles, 
the 100 fifth- and sixth-grade children (51 girls and 49 boys) varied in 
age from 10 years, 7 months to 13 years, 2 months. 


Instrumentation 


To represent the constructs of reflection-impulsivity, field 
dependence-independence, and internal-external locus of control, 
respectively, the measures or scales entitled Matching Familiar 
Figures (MFF) yielding separate time scores and error scores (Witkin, 
et al., 1971), Embedded Figures Test (EFT) furnishing a time score 
only (Kagan, et al., 1964), and Children's Locus of Control Scale 
(CLOCS) providing a score of the number of correctly answered items, 
were employed. On the MFF and EFT high time scores were con- 
sidered to indicate, respectively, reflection and field dependence, 
whereas low time scores suggested, respectively, impulsivity and field 
independence. On the MFF a high error score was interpreted às 
revealing impulsivity, and a low error score, reflection. 

The construct of fantasy predisposition was defined operationally as 
the (a) M score derived from the Holtzman Inkblot Technique (HIT) 
(Holtzman, 1963), (b) the total score from Forms A and B of the Tor- 
rance Tests of Creative Thinking (TTCT)—Thinking Creativity with 
Words, Activity 7: Just Suppose (Torrance, 1966, 1974), and (c) an un- 
Weighted sum of scores (a composite score) on these two measures. 
Although it could be argued rather convincingly that а composite 
score might reveal a complex construct reflecting quite varied psy- 
chological processes, the correlational analysis of each individual fan- 
tasy measure which contributed to the composite score as well as of 
the composite score with each of the other four measures of cognitive 
style would prevent any important information loss. 

In addition to the seven psychological variables just enumerated, a 
Measure of chronological age was also included. All seven test 
Ah. their intended constructs, and the age variable are cited in 

able 1. 


Test Administration 


t administered the five 


The first cited author and a trained assistan 
of four 


tests employed in a specific order during a 4-day period at each 
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schools. Working in separate rooms, the examiners gave individual ad- 
ministrations of the MFF, EFT, and the TTCT subtest. The CLOCS 
and HIT were administered to groups of approximately 10 children 
approximately one or two days after the examinees had completed the 
first three tests. 


Data Analysis 


Intercorrelations of scores оп the seven test variables and 
chronological age were effected for the total sample as well as for 
separate subsamples of 49 boys and 51 girls. For each of these three 
groups, ranges, means, and standard deviations of scores on each 
variable were calculated and are reported in Table 1. In view of the 
directional nature of the research hypotheses one-tailed significance 
tests were employed. 


Findings 


In terms of the correlational entries in Table 1, the following results 
may be summarized. 

1. Although for the total group the correlation coefficients between 
scores on scales of reflection-impulsivity and of field dependence- 
independence and scores on each of the three fantasy measures were 
small (varying from —.09 to .08) and statistically not significant, 
statistically reliable but low positive correlation coefficients (.21, 128; 
and .29) were found between a measure of internal locus of control 
and each of three measures of fantasy predisposition. | ў 

2. For the subgroup of boys, low correlation coefficients ranging 
from —.26 to .23 (only one being significant) between measures of fan- 
{азу predisposition and those of cognitive style were obtained. 

3. Although for the subgroup of girls coefficients of —.28 and —.24 
between a measure of field dependence and each of two measures of 
fantasy predisposition were statistically significant at the 05 level, 
coefficients of .36, .38, and .51 between the measure of locus of internal 
control and each of the three measures of fantasy predisposition were 
statistically reliable beyond the .01 level. р 

4. For all samples, the values of the intercorrelations among the 
measures of cognitive style fell between —.25 and .35 with only six 
coefficients attaining statistical significance. 

5. With respect to the HIT measure (Variable 6) and the Total Fan- 
tasy Score (Variable 7) the mean of the subsample of girls was signifi- 
cantly higher than that of the boys (р < 101). 


6. In the instance of the subsample of girls, statistically significant 
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correlation coefficients of —.29 and .25, respectively, occurred between 
age and (time) scores on the measure of field dependence and between 
age and scores on the HIT measure of fantasy predisposition. 


Conclusions 


On the basis of the data obtained, the following conclusions were 
drawn: 


l. In the main, only limited support was obtained for the first 
hypothesis and practically no support for the second hypothesis. 


2. In light of the modest correlations obtained for girls between 
locus of internal control and fantasy predisposition it was suggested 
that girls entering or about to enter adolescence who show somewhat 
greater internal control might be expected to exhibit a higher level of 
predisposition to fantasy. 


3. Fantasy would not seem to be part of larger or more general con- 
struct of cognitive style, 


4. The presence of a universal or general construct of cognitive style 
would appear to be highly improbable. 


Discussion 


That more striking evidence for the support of the two major 
hypotheses was not found could be attributed to a number of factors, 
particularly the unreliability of the measures employed. Time con- 
straints imposed by the school program prevented the realization of 
test-retest estimates of reliability. Furthermore, the lack of 
homogeneity of items within many of the scales not only precluded the 
determination of interpretable estimates of internal consistency but 
also prevented the realization of estimates of potential maximum cor- 
relations that could have been attained between two sets of scores 
theoretically free of error variance. 

One disconcerting finding was the pattern of positive and negative 
signs accompanying the three coefficients arising from the intercorrela- 
tions of the first three measures or variables cited in Table 1. Although 
the negative coefficients of —.22, —.21, and —.21 for the total sample, 
the subsample of boys, and the subsample of girls, respectively 
between time scores of the MFF (Variable 1) and error scores of the 
MFF (Variable 2) were anticipated, as were the corresponding 
positive coefficients of .30, .35, and .22 between Variable 2 (MFF— 
Error Score) and Variable 3 (EFT—Time Score), the positive соећ- 


SUSAN МСМАКҮ, ЕТ AL. 483 


cients of .19, .14, and .31 between Variable 1 (MFF—Time Score) and 
Variable 3 (EFT—Time Score) were expected to be negative. In other 
words, as expected, the correlations between measures of reflection 
(Variable 1) and impulsivity (Variable 2) were negative and those 
between impulsivity (Variable 2) and field dependence (Variable 3) 
were positive. Not anticipated, however, was the positive correlation 
between the measure of reflection (Variable 1) and that of field 
dependence (Variable 3), although admittedly only the positive correla- 
tion of .31 for the subsample of girls was statistically reliable. The ex- 
aminers did note during the testing sessions a tendency on the part of a 
large proportion of children who displayed behavior interpreted as be- 
ing field dependent and impulsive to perseverate in responses revealing 
errors or mistakes in performance. Hence, the extension in time as- 
sociated with perseveration that took place in the completion of the 
MFF and EFT tasks (both scored for the amount of time required) 
might have accounted in part for the small positive correlation 
between reflection (Variable 1) and field dependence (Variable 2). 
Moreover, there was the distinct possibility that the unfamiliarity of 
the tasks and the lack of sufficient prior time spent on practice ques- 
tions or in clarification of task requirements could have contributed a 
disproportionate increment to the time score—time that had to be 
spent in comprehending the nature of the task and in commission of 
several errors before the child could eventually figure out what he or 
she had to do. Thus in addition to a perseverative error, the additional 
time required of a relatively immature group to comprehend the 
nature of task requirements would suggest that the correlation 
reflected a time component artifact that in reality was unrelated to the 
intended perceptual and affective characteristics underlying the 
reflection-impulsivity and field dependence-independence constructs 
as conceptualized by theorists who defined them. 

The existence of sex differences in the manifestation of fantasy could 
be accounted for in terms of a number of speculative factors such as 
the nature of the particular measures chosen, level of cognitive and 
perceptual maturation of the subsamples of boys and girls, richness of 
varied stimuli in the home environment permitting a basis for im- 
aginative thinking (as related in turn to the socioeconomic status of 
the families), and the social expectations of significant others in the 
adult or peer culture regarding the desirability or acceptability of dis- 
playing fantasy-related behaviors. Although Biblow (1970) and 
Pulaski (1970) failed to find significant sex differences in their respec- 
tive investigations of fifth-grade and kindergarten children, it 15 con- 
ceivable that differences observed in the two subsamples studied were 


related to a combination of several of the factors just cited. 
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Recommendations 


In view of the somewhat inconclusive results of this investigati 
is suggested that a need exists to examine within a comprehei 
theoretical framework a number of constructs pertaining to сорпій 
styles, predisposition to fantasy, perseveration, creative abilities, and 
problem-solving skills as well as to develop appropriate instruments 
for their operational description. In combination with a sound and in- 
ventive conceptualization of affective and cognitive constructs use o 
the paradigm afforded by the multitrait-multimethod matrix devised. 
by Campbell and Fiske (1959) would afford a useful methodology for 
clarification of the nature of the constructs studied. ad 

Initiation of longitudinal studies would provide evidence concerning. 
the influence of age, sex, socioeconomic background, and subcultures. 
on the manifestation of fantasy and identifiable cognitive styles. | 
could be interpreted that applications of the Campbell and Fiske 
methodology on a longitudinal basis would do much to improve 
only the clarity and meaningfulness of the constructs studied but 4 
the validity and reliability of their empirical representation by 4 
variety of tests and scales. 
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A COMPARISON OF THE WRAT AND THE PIAT 
WITH LEARNING DISABILITY CHILDREN 


DALE D. BAUM 
New Mexico State University 


This study was concerned with the comparative performance of 
learning disability children on the WRAT and the PIAT. Correla- 
tions between corresponding subtests of the two instruments were 
quite high for Reading and moderately high for Spelling and 
Arithmetic across four age levels (7 to 8, 9, 10, and 11 year olds). It 
was concluded that the utility of the PIAT is as promising as that of 
the WRAT, although diagnosticians using the PIAT may wish to ex- 
clude the Reading Comprehension subtest, as this subtest tends to be 
AD the same skills as Reading Recognition at the lower grade 
levels, 


PsycHoLoaists and educational diagnosticians responsible for the 
assessment of children with learning problems generally include a test 
of school achievement in their battery of instruments if current infor- 
mation is not readily available from teachers or from the school files. 
Typically, the instrument of choice has been an individually ad- 
ministered test of the screening or wide-range variety which would 
Yield an overview of an individual’s academic status. 3i 

The Wide Range Achievement Test (WRAT) in its various editions 
Üastak, 1936, 1946; Jastak and Jastak, 1965) has enjoyed wide pop- 
Шагйу as a relatively quick test of the basic school subjects of 
Reading, Spelling, and Arithmetic. Until the recent introduction of the 
Peabody Individual Achievement Test (PIAT) by Dunn and 
Markwardt (1970) the WRAT had maintained a virtually unchal- 
lenged position in educational diagnosis. Proger (1970) has suggested 
that the PIAT represents a sophisticated and formidable challenge to 
the WRAT, 

In addition to the subtests Reading Recognition, Spelling, and 


п Copyright © 1975 by Frederic Kuder 
487 


488 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT. Мү? 


Mathematics, the PIAT also includes subtests of Reading Comprehen- 
sion and General Information which have no counterpart on the 
WRAT. The PIAT also provides normative data for a total test score 
while the WRAT does not. The PIAT employs multiple choice selec- 
tion for Reading Comprehension, Spelling, and Mathematics, while 
the WRAT measures Reading by word recognition, Spelling from dic- 
tation, and Arithmetic from computation. The WRAT reportedly re- 
quires 20 to 30 minutes to administer, whereas the PIAT requires 30 to 
40 minutes. | 

In previous research by Sittington (1970) with educable mentally: 
retarded adolescents and by Soethe (1972) employing normal, readi 
disabled, and mentally retarded children, the conclusion was reach: 
that those subtests of the PIAT having counterparts on the WRAT 
demonstrated reasonably high concurrent validity. The present in- 
vestigation was concerned with the comparative performance of learn: 
ing disabled children of various ages on the WRAT and the PIAT. 


Method 


The 82 males and 18 females who served as subjects for this study 
were randomly selected from self-contained classes for learning d 
abled students. The classes were located in primarily middle а! 
upper-middle class suburban areas adjacent to a large southern cit Б 
. Twenty-five subjects were selected from each of the four age groups 
7 to 8, 9, 10, and 11 year olds. The average age for the total group Wi 
9 years, 7 months with a range from 7 years, 2 months to 11 years, 
months. - 

Both the WRAT and the PIAT were individually administered 
each subject in a single session. The order for presenting the two 
was reversed for every other subject in order to counterbalance 
practice effects which might have accrued. : 


Results 


The means and standard deviations of grade equivalency score 
each of the subtests of the WRAT and the PIAT are presented in 1 
| for subjects at each of the four age levels studied. In general, the 
test scores of the WRAT did correspond more closely with. 
counterparts on the PIAT at the 7 to 8 year level than they did 
11 year level. This finding, however, tends merely to reflect. 
characteristics of both tests, i.e., as age increases the standard 
measurement increases on both instruments. 

Pearson product moment correlation coefficients betw' 
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responding subtests of the WRAT and the PIAT are presented in 
Table 2. The WRAT Reading subtest was highly correlated (.85 to .89) 
with the PIAT Reading Recognition subtest across all four age levels. 
The PIAT Reading Comprehension subtest correlated quite highly 
(.90) with WRAT Reading at the 11 year level, moderately (.72 and 
.62) at the 7 to 8 and 9 year levels, and relatively low (.56) at the 10 
year level. Correlations between the PIAT and WRAT Spelling sub- 
tests were moderate across all four age levels (.61 to .71), whereas the 
correlations between PIAT Mathematics and WRAT Arithmetic 
ranged from moderate (.77) at the 7 to 8 year level to quite low (.49) at 
the 11 year level. 

The intercorrelations of the subtests of the two instruments are 
presented in Table 3. Employing the entire sample of 100 subjects, 
WRAT Arithmetic correlated .77 with PIAT Mathematics, whereas 
WRAT Spelling correlated .81 with PIAT Reading Recognition and 
177 with PIAT Reading Comprehension. Of additional interest is the 
finding that the subtests of the WRAT did correlate as highly with the 
PIAT information and Total Test as did the remaining subtests of the 
PIAT. It should be noted, of course, that the heterogeneity in age of 
the total sample contributed to the realization of what must be viewed 
as relatively high coefficients. 


Discussion 


In this study PIAT Reading Comprehension and Reading Recogni- 
tion correlated .78 as compared with .84 in both the Sittington (1970) 
and Soethe (1972) studies. Since the first 18 items of the Reading 
Recognition subtest count as the first 18 items of the Reading 
Comprehension subtest, relatively high correlations between these 


; TABLE 2 
Correlations between Selected Subtests of the PIAT and WRAT by Age Level 


Age Level 7108 9 10 11 
N 25 25 25 25 

PIAT WRAT Reading 

Reading Recognition ,85** .87** .89* .87** 

Reading Comprehension 2** .62** .56* .90** 

PIAT WRAT Spelling 

Spelling .62** qe 62% 61%" 

PIAT 90 WRAT Arithmetic 

Mathematics que ‚79** .65** .49* 


* Significant at .05 level. 
** Significant at .01 level. 
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subtests would be expected with subjects who function at the lower 
grade levels because of the overlapping content. Since the content 
validity of the Reading Comprehension subtest at the lower grade 
levels is questionable because of its sharing of the same content with 
Reading Recognition, this subtest should be interpreted cautiously by 
diagnosticians who employ the instrument with subjects functioning at 
the lower grade levels. 

It was also observed that PIAT Reading Comprehension correlated 
77 with both the WRAT Spelling and Reading subtests even though 
the first subtest is allegedly measuring comprehension; the second, 
spelling by dictation; and the third, word recognition. Similarly, PLAT 
Reading Recognition correlated .81 with WRAT Spelling and .92 with 
WRAT Reading. These relatively high intercorrelations suggested that 
a set of prerequisite skills peculiar to the language arts area operates 
fairly evenly across these subtests at the lower grade levels. Therefore, 
the language arts oriented subtests both within and between the PIAT 
and the WRAT share a commonality of both general educational con- 
tent and psychological process which tends to lessen the specific con- 
tent validity of these individual subtests at the lower grade levels. 


Conclusion 


In view of the high degree of correlation between the two instru- 
ments, the findings of this study suggested that the utility of the PIAT 
as a wide-range achievement test is at least as promising as the WRAT 
has been. However, there appears to be little advantage in administer- 
ing both PIAT Reading Comprehension and Reading Recognition to 
subjects functioning at the lower grade levels, since both subtests tend 
to be measuring the same skills. 

When either of the two instruments is considered as a criterion, 4 
substantial degree of concurrent validity relative to school achieve- 
ment would appear to exist between like subtests of the two instru- 
ments. In view of the more comprehensive and diagnostic data possi- 
ble with an additional two subtests and a total test score, it would ap- 
pear that the PIAT can be employed confidently in the assessment of 
children with learning problems. 
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VALIDITY AND RELIABILITY OF A SIMPLE DEVICE 
FOR READINESS SCREENING 


MARJORIE HAYES' 
CRESTWOOD KINDERGARTEN 


EMANUEL MASON 
University of Kentucky 


ROBERT COVERT 
University of Virginia 


The utility of the Hayes Early Identification Listening Response 
Test (HEILRT), a rapidly administered screening test for readiness 
for first grade, was studied with 121 kindergarten pupils who were 
tested at the beginning of the academic year. The test could be ad- 
ministered in 15 to 20 minutes to groups of up to 30 children at a 
time. Reliability of the test was estimated to be .86 and rater 
reliability was .99. The test correlated highly positively (.79) with the 
Metropolitan Readiness Test, but not with age, number of siblings, 
or educational level of the mother. It was concluded that based upon 
the data, the HEILRT held promise as а readiness screening instru- 
ment for use with kindergarten children; 


gin work in school 
Standardized tests 
ness have tended 


DECIDING the readiness of young children to be 
has long been an issue concerning educators. 
developed to assess the various facets of school readi 
to be quite long and arduous to administer to young children who are 
Not used to testing and school routine (e.g. the Metropolitan 
Readiness Test and the Stanford Early School Achievement Test). 
Readiness testing could be more efficient if a reliable and valid initial 


Screening instrument were available which broadly assessed readiness 
Rie ase ore 
entification Listening Test 


* Copies of the materials developed for the Hayes Early 19 5 
Crestwood Kindergarten, 


can be obtained by writing to Mrs. Marjorie Hayes, Director, 
006 East Main Street, Frankfort, Kentucky 40601. 
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skills, and which took little time to administer and score. The purpose 
of the present study was to investigate the characteristics and utility of 
such a test, the Hayes Early Identification Listening Response Test 
(HEILRT). The HEILRT was designed to emphasize the importance 
of listening comprehension, visual perceptual, and fine motor skills. 
The test was developed over several years to screen young children for 
readiness rapidly and accurately by using tasks of the type with which 
children beginning school were usually familiar. 


Method 


The HEILRT was administered to 121 beginning kindergarten 
pupils in a private school in central Kentucky. The mean age of the 
group at the time of the test administration was 62 months. The test 
was administered by the staff of the school to groups of 15 to 30 
children at a time during the first month of school. 

The HEILRT contains a series of psychomotor tasks for which the 
teacher gives verbal instructions. Each child is initially given a piece of 
blank paper and a box of coloring crayons. Standardized instructions 
require that the teacher show several examples at the blackboard. 
Then tasks are given children to perform at their tables, such as "Draw 
a line standing up," *Draw a line lying down," and “Draw what you 
think is number four," and “Draw a circle. Color the circle red." The 
present experimental version of the test contains ten tasks. Each item 
is scored for a number of specified credits totaling 22 points. 


Results 


For the 121 pupils stu in this report, the mean score was 15.65 
with a standard deviationgof 4.71; the median was 16.75; and the 
modal score was 17. The reliability estimate based on use of the 
Kuder-Richardson formula 20 was found to be .86, and the standard | 
error of measurement was 1.73. Each of four untrained raters was 
asked to score a random sample of 10 protocols. From these ratings; 
interrater reliability was estimated to be .99 through using the analysis 
of variance model (Kerlinger, 1971). Thus, the HEILRT test score was 
reliable in terms of the scorers’ performance as well as the children 5. 

Validity was studied by considering the relationship of HEILRT 
scores to scores on the Metropolitan Readiness Test (MRT) (Hildreth, 
Griffiths, and McGauvran, 1965) given later in the kindergarten year 
and other variables. Table 1 shows the intercorrelations of these 
variables with the HEILRT. Since HEILRT correlated highly 
positively (.79) with MRT Total Score, it was a valid predictor of 
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MRT total score. In addition, the test correlated positively with MRT 


subtest scores. It is important to note that neither MRT nor HEILRT | 


correlated very highly with age, number of siblings, or educational 
level of the mother and father. Thus, the pattern of correlation would 
appear to lend some support to the construct validity of the measure- 
ment of the test by its independence from these variables and its high 
correlation with the MRT total score. However, the correlation of .29 
between sex and HEILRT scores suggests that these scores were higher 
for girls than for boys. This outcome corresponds to the generally ac- 
cepted notion that at younger ages, girls have higher readiness skills 
than do boys. 


Discussion and Conclusions 


Although the HEILRT is still in experimental form, results suggest 
the test to be valid and reliable for readiness screening of young 
children. The ease of administration and scoring by untrained class- 
room teachers is an important advantage of the test. In addition, the 
test can be administered to groups of up to 30 children in only 15 to 20 
minutes. It is thought that the test can be used most effectively as part 
of a schoolwide testing program for readiness when administered by 
the Guidance Department or reading teachers. It can be economically 
and easily given to all pre-school or kindergarten children in a school 
population. Those children who obtain low scores can be identified 
and examined further for specific readiness problems. ; 

Another feature of the test is its obvious economy. The test takes lit- 
tle time to administer and to score. It does not require lengthy training 
for the examiner or scorer. Furthermore, the materials necessary are 
only a set of instructions, a set of crayons, and a piece of blank paper. 
Research on the test is continuing. 


REFERENCES р 
Hildreth, б. H., Griffiths, М. L., and McGauvran, M. E. Metropolitan 
Readiness Tests. New York: Harcourt, Brace and World, | ў 
Kerlinger, Е. Foundations of behavioral research (2nd ed.). New York: 
Holt, Rinehart and Winston, 1973. 


| 


DUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
1975, 35, 499-501. 


INTERCORRELATIONS AMONG MEASURES OF 

_ INTELLIGENCE, ACHIEVEMENT, SELF-ESTEEM, AND 

ANXIETY IN TWO GROUPS OF ELEMENTARY SCHOOL 
PUPILS EXPOSED TO TWO DIFFERENT MODELS 

OF INSTRUCTION’ 


JOHN LEWIS 
Winona State College 


RICHARD ADANK 
Winona Public Schools 


The intercorrelations among intelligence, achievement, self- 
esteem, and anxiety measures were studied among fourth, fifth, and 
Sixth grade pupils in self-contained and individualized programs. In 
addition to significant positive interrelationships among (һе 
measures of intelligence, achievement, and self-esteem for each of the 
groups exposed to different models of instruction, the lack оға 
significant negative correlation between the measure of anxiety and 
either the achievement or intelligence measure for the group exposed 
to individualized instruction as compared with the presence of a cor- 
responding significant negative correlation for the group т the 
traditional self-contained model was judged noteworthy. 


to determine the intercorrelations 
achievement, self-esteem, and anxiety 
among elementary pupils who were enrolled in two different styles of 
Т elementary instruction. The two schools were a traditional self- 
Contained structure and an open school using the Westinghouse PLAN 
System of computer-backed individualized instruction. 


— THis study was undertaken 
among measures of intelligence, 


Procedures a 

__ The subjects were pupils in grade four, five, and six in both schools 

for whom complete data could be found. Intelligence was defined as 
gm сотр ий 

"Тһе research efforts underlying this study were supported by Title IHI funds made 


| Available to the Winona Public School System, Winona, Minnesota, Project No. 33-71- 
4049 by the United States Department of Health, Education, and Welfare. 
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IQ scores from the SRA Tests of General Ability. Achievement was 

` defined as the composite of raw scores from the Stanford Achievement 
Test, and self-esteem was measured by the Self-Esteem Inventory 
(Coopersmith, 1967). Anxiety was defined as the combined raw scores 
from the General Anxiety Scale for Children (Sarason, 1960) and the 
Test Anxiety Scale for Children (Sarason, 1960). These anxiety scales 
were constructed so that positive responses and hence higher scores in- 
dicate higher anxiety. The correlations were calculated among the 
variables at each of the three grade levels, then converted into Fisher's 
2 scores, weighted by the N’s at each grade, summed, and then recon- 
verted into r values. 4 


Results 


The intercorrelations along with indications of significance for 
pupils in each of the two schools are shown in Table 1. 

Within each group of pupils exposed to a different mode of instruc- 
tion, significant positive intercorrelations were found among intel- 
ligence, achievement, and self-esteem measures and significant nega- 
tive correlations were found between scores in the self-esteem and 
anxiety tests. Significant negative correlations appeared between 
scores in the anxiety measure and those in achievement and intel- 
ligence tests for pupils in the self-contained school, but non-significant 
negative correlations were found between scores in the anxiety 


TABLE 1 
Correlations among Intelligence, Achievement, Self-Esteem, and Anxiety Measures 
among Elementary Pupils in Two Models of Measured Instruction 


Individualized Instructors Plan 
(N = 89) 


Test Measure 1 2 3 A 
1. SRA Tests of General Ability (IQ Scores) = .68** .24* -.16 
2. Stanford Achievement Test (Composite) 68% — 25055 ps 
3. Self-Esteem Inventory 24% 30% - -23 
4. Combined Anxiety Score —.16 —.16 32:28; 

Self-Contained Structure 
(N = 130) 

Test Measure 1 2 3 4 
1. SRA Tests of General Ability (IQ Scores) — 2594 24% — 350" 
2. Stanford Achievement Test (Composite) 595% — .42** uo 
3. Self-Esteem Inventory dae 02% ciet c dt 
4. Combined Anxiety Score —35**  —38** —.41** и; 
*p < 05. 


** p < 01. 
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measure and scores in the intelligence and achievement tests for pupils 
in the individualized school. 

These results suggest that although the interrelationships among 
self-esteem, intelligence, and achievement measures did not vary sub- 
stantially among pupils enrolled in the two models of instruction the 
use of the individualized model might reduce the extent of the negative 
relationship between anxiety and either achievement or aptitude. 
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THE RELATIONSHIP BETWEEN ATTITUDES TOWARD 
SCHOOL AND ACHIEVEMENT FOR GROUPS OF 
ELEMENTARY SCHOOL CHILDREN EXPOSED 
TO TWO MODELS OF INSTRUCTION' 


JOHN LEWIS 
Winona State College 


RICHARD ADANK 
Winona Public Schools 


A study of the relationship between attitudes toward school and 
learning among pupils in grades one through six in two different 
models of instruction revealed four significant correlations among 
twelve calculated. The implication was that attitudes towards school 
were not systematically related to learning among elementary pupils. 


lowards school was related to an objective measure of learning within 
each of two groups of pupils exposed to different models of instruc- 


lion. The first group attended an elementary school which afforded an 
use PLAN 


open-styled model of instruction involving the Westingho 
p attended 


tomputer-backed individualized program. The second grou 
à traditional self-contained elementary school. 


Procedures 


‚ The subjects were 286 pupils in grades one through six in the in- 
dividualized school and 335 pupils in grades one through six in the 
self-contained school. Both schools are located in Winona, Minnesota. 
Pupil attitude toward school was measured by a pictorial attitude 
me 


“Тһе research efforts underlying this study were supported by Title III funds made 
ävailable to the Winona Public School System, Winona, Minnesota, Project No. 33-71- 
9 by the United States Department of Health, Education, and Welfare. 
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Tuis study was undertaken to determine whether measured attitude 
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scale (Lewis, 1974) and pupil learning was measured by the Lee-Clarke 
Reading Test for pupils in grades one and two and by the Stanford 
Achievement Tests for pupils in grades three through six. The attitude 
measure was administered at the beginning of the school year. The 
achievement measures were given at the end of the school year. 


Results 


The correlations between measured attitude and achievement at 
each grade level in each of the two schools emphasizing different 
models of instruction are presented in Table 1. It should be noted that 
the attitude scale provided for three possible responses: “like,” 
“neutral,” and "dislike." Very few pupils responded in the dislike 
category to any of the items. Thus, the scores on the attitude scale es- 
sentially reflected feelings of either liking or being indifferent toward 
school. 

Significant relationships between attitude and achievement ap- 
peared at two of the six grade levels at each of the two elementary 
schools for a total of four significant correlations among the twelve 
groups. Non-significant relationships appeared among the other eight 
groups of pupils. Thus, responses of generally liking or being neutral 
toward the items on this scale taken at the beginning of the school year 
were not systematically related to pupil achievement in school 
measured at the end of the year. 


Discussion 


The appearance of some significant relationships, however, did raise 
the question of why attitudes were related to achievement among cT 
tain groups of pupils but not among others. Since two of these signifi 
cant correlations appeared among pupils in grades one and five in the 
school with an individualized program and among pupils in grades 
three and four in the school with a self-contained structure, it appea’® 


TABLE | р 
Correlations between Attitudes toward School and Learning 
in Two Models of Elementary Instruction 


Grade Level 


Model/Grade 1 2 3 4 5 6 
Individualized .26* —.13 11 100 55 2 
Self-Contained —.06 n el 40% .30* 15 д 


* Significant at the .05 level. 
*4 Significant at the .01 level. 
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ors contributing to the question of a relationship between 
0 variables would not seem to include either grade level ог | 
of instruction. It was previously pointed out that subjects 
responded in terms of a neutral or positive attitude toward 
Thus, these findings should not be generalized to situations in 
large number of pupils exhibit negative feelings towards 
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CONCURRENT VALIDITY OF THE TORRANCE 
TESTS OF CREATIVE THINKING AND THE 
WELSH FIGURAL PREFERENCE TEST 


THOMAS М. GOOLSBY, JR AND LOREN D. HELWIG 
University of Georgia 


For a sample of 79 fifth grade pupils from a semi-rural farming 
area in the Southeastern United States intercorrelations among sub- 
test scores and total scores of the Torrance Tests of Creative Think- 
ing (TTCT) and the Welsh Figural Preference Test (МЕРТ) revealed 
little if any relationship either between subtests from the TTCT and 
those from the WFPT or between the two total tests themselves. It 
would appear that the two scales which were designed to measure 
constructs related to creativity share little common variance. 


) 


THE Torrance Tests of Creative Thinking (TTCT) and the Welsh 
Figural Preference Test (WFPT) are two tests purporting to measure 
constructs related to creativity (Torrance, 1974; Welsh, 1959, 1972). 
Whereas the TTCT scales appear to be more cognitively oriented in 
approach, the WFPT seems to reflect affective characteristics. f 

The purpose of this study was to explore the patterns of inter- 

_ Telationships among the part scores and total scores of the TTCT and 

WFPT for a sample of fifth grade pupils. : 

Seventy-nine fifth grade pupils from а semi-rural farming area in the 

- Southeastern United States were administered the Figural Form B of 
TTCT and the WFPT within two days. 

The Figural Form B of the TTCT yields subscores of Fluency (FD, 
Flexibility (Fx), Originality (Or), and Elaboration (ED. The WFPT 
furnishes subscores on subtests named "origence" (Og) and “intel- 
lectence? (In). 

A zero-order intercorrelation 
deviations was obtained using the total test and $ 
ТТСТ and WFPT. 

Copyright © 1975 by Frederic Kuder 


matrix with means and standard 
ubtest scores for the 
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TABLE 1 
Intercorrelations among Subtest and Total Test Scores with Means and 
Standard Deviations for TTCT and WFPT (М = 79) 


Fl Fx Or El In 


TTCT (total test 215 
Fluency (FI) 
Flexibility (Fx) 76 
Originality (Or) 14 17 
Elaboration (ЕГ) 49 42 40 

WFPT (total test) 11 -05 -02 02 
"'Intellectence" (In) -08 —12 305: 617 
“Огірепсе” (Og) 14 -02 —01 07 20 


Note.—Decimals omitted for correlation coefficients. 


Table 1 shows the intercorrelations among subtest and total test 
scores with means and standard deviations for TTCT and WFPT. The 
correlation between total test scores for TTCT and WFPT was .02. 
Evidence of concurrent validity between TTCT and WFPT was not 
demonstrated for fifth grade pupils. That this evidence was disconfirm- 
ing was not a surprising outcome, since the TTCT and WFPT were 
constructed from different frameworks of reference and/or theory 
bases, 

Some corelative evidence was viewed as useful in identifying sources 
of shared variance in scores among selected pairings of subtests or of 
subtests with an entire test of which they were not parts. 
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THE CORRELATION OF PARTIAL AND TOTAL 
SCORES OF THE SCHOLASTIC APTITUDE TEST OF 
THE COLLEGE ENTRANCE EXAMINATION BOARD 

WITH GRADES IN FRESHMAN CHEMISTRY 


L. G. PEDERSEN 
University of North Carolina, Chapel Hill 


For a sample of 325 students enrolled in freshman chemistry, it 
was found that performance on the Verbal and Mathematics Sec- 
tions of the Scholastic Aptitude Test of the College Entrance Ex- 
amination Board exhibited low predictive validity in relation to à 
criterion of grades in freshmen chemistry. 


A major problem at many universities is how to effect curriculum 
changes which will increase the chances of college success for 
educationally deprived and underprepared students (Kotnik, 1974; 
Meckstroth, 1974; Wartell, 1974). The Chemistry Department at the 
University of North Carolina, Chapel Hill, responded to this need by 
Scheduling a preparatory course for freshman chemistry for Fall, 1974. 
An immediate concern pertained to the criterion by which a student 
Would be placed in such a preparatory course. Would one use as à 
predictor, grades on the Scholastic Aptitude Test (SAT) of the College 
Entrance Examination Board (CEEB), high school rank, a linear com- 
bination of these two variables, or some other procedure? 


Purpose 


- The purpose of this study was to determine the degree of 
| Telationship between level of achievement in freshman chemistry and 
(а) scores on the Verbal (V) part of the CEEB, (b) scores on the 
Mathematics (M) part of the CEEB, and (c) an unweighted composite 
T) of the two part scores, In addition the worthiness of a model 
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involving use of cutoff scores was evaluated for its predictive 
capability. 


Method and Findings 


In the present study a data base was established in terms of the 
grades in the first semester freshman chemistry (4.0 scale; А = 4.0, B = 
3.0, C = 2.0, D = 1.0, and F = 0.0) of 325 students enrolled during 
Fall 1974 and the Mathematics (M), Verbal (V), and Combined (T) 
scores of the same students. The data base consisted of two large sec- 
tions taught by two different teachers. About 1596 of the students had 
previously been placed in an “honors” type section on the basis of 
superior preparation, high school standing, and special examinations. 
To construct a criterion for prediction, the average SAT scores for stu- 
dent grades in the five grade categories (Table 1) were evaluated. For 
all three SAT categories—V, M, and T—the average grade point was 
seen to increase regularly as the scores in the three test categories in- 
creased. If one wishes to use the SAT score averages to predict grades, 
one must establish a cutoff for A's, B's, or other grades in the three 
SAT categories. A logical way to decide between an A and B would be 
to choose as a cutoff the midpoint between the A and В score—those 
above this midpoint would receive A's, those below (but above the 
C/B midpoint) would receive B's. These cutoffs were computed for the 
three SAT categories and are given in Table 2. 

Also given in Table 2 are the results for the predictability of the 
model. The average cutoffs suggested in Table 2 were used 10 
recalculate the grades for the 325 students of the data base. The level 
of predictability is seen to be almost useless for forecasting the actual 
grade. The model works considerably better for predicting the grades 
within one letter grade of the correct grade (last line of Table 2), but it 
may not be very satisfying to a student or advisor to know that the stu- 


TABLE 1 al 

Mean and Standard Deviation (SD) of Verbal (V), Mathematical (M), and. Toti 

(T) Scores of the Scholastic Aptitude Test of The College Entrance Examination 
Board for Freshman Students in Chemistry Receiving Different Grades 


и M T. 
Grade Mean SD Mean SD Mean SD 
A 549.1 83.1 622.1 62.8 1171.2 1285 
B 5268 707 570 592 1103.8 1140 
с 501.1 762 5595 586 1060.7 110.8 
D 4798 606 5348 609 1014.6 90.5 
Е 4677 548 5148 683 982.5 1055 
All Students 510.3 566.5 1076.8 


Note.—The average grade point in Fall, 1974 for all 325 students in the data base WAS 229. 
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ТАВГЕ 2 
Cut-off Grading Model Based on SAT Scores 
SAT Category 
Decision Between: V M T. 
A and B 537.9 599.6 1137.5 
Band С 513.9 568.3 1082.3 
Салар 490.5 547.2 1037.7 
РапаЕ 473.7 524.8 998.6 
Percentage of Accurate Grade 
Predictions* 240 28.6 31.4 
Percentage of Nearly Accurate 
Grade Predictions” 56.2 64.7 66.2 


* Percentage of grades predicted correctly on basis of cut-off model. Thus only 31.4% of the 325 students had 
their grades predicted correctly on basis of composite (7). 

* Percentage of grades predicted correctly within one letter grade, i.e., F-D, D-C-B, C-B-A, B-A on basis of 
cut-off model, Thus 66.2% of the 325 had their grades predicted within one letter grade. 


dent is predicted to make а D, C, or B with 2/3 probability. In all cases 
the combined T category was slightly superior to the M category, and 
the M category was superior to the V category. The correlation coeffi- 
cient was calculated for each SAT category vs. grade distribution. The 
coefficients for V, M, and Т measures were, respectively, .32, 142, and 
43, all of which were significant beyond the .01 level. The magnitudes 
of the correlation coefficients indicated that the level of correlation 
was not very useful just as the grade cut-off model indicated. 


Conclusions 


The following conclusions were drawn: Я 

1. The functional dependence of average grade obtained in fresh- 
man chemistry vs. average SAT category was regular and monotonic, 
ie., the higher the average scores on the V, M, or T measures for а 
group of freshmen, the higher would be the average grade of that 
hoe in freshman chemistry. The standard deviations, however, were 
arge, 

2. There was little freshman chemistry grade predictability from 
standing in any of the SAT-grade categories. 0 

3. Use of SAT scores for deciding the placement of an individual 
Student was not statistically justified. 
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Jere E. Brophy and Thomas L. Good. Teacher-Student Relationships: 
Causes and Consequences. New York: Holt, Rinehart & Winston, 
Inc., 1974. Pp. xvi + 400. $5.95 (paperback). 


"This book is а well-written review and discussion of research on 
teacher expectations and attitudes and how they affect and are affected 
by various characteristics of students. The approximately 500 
references cited in the 11 chapters reveal something of the volume of 
research stemming from Rosenthal and Jacobson's Pygmalion in the 
Classroom. As is frequently the case in educational research, the 
results of these efforts are open to divers interpretations. It is obvious 
from reading the book that its authors are committed to the belief that 
teacher expectations do, under certain circumstances and with certain 
teachers and students, affect the behavior of the latter. 

Since, at least for this book, a listing of the table of contents reveals 
the substance of the topics considered, here it is: 1. Individual 
Differences in Teacher-Student Interaction Patterns; 2i Teacher Ex- 
pectations; 3. Studies of Experimentally Induced Expectations; 4. 
Naturalistic Studies of Teacher Expectation Effects; 5. The Influences 
of Teachers’ Attitudes toward Students on Classroom Behavior; 6. 
Teacher Interview and Questionnaire Studies; 7. The Influences of the 
Sex of the Teacher and Student on Classroom Behavior; 8. Individual 
Differences and Their Implications for Teachers and Students; 9. 
Promoting Proactive Teaching; 10. Classroom Research: Some Sug- 
Bestions for the Future; 11. Implications for Teaching. 

As concluded from the summaries in Chapters 3 and 4, more of the 
tations found negative results, 
but the reverse is true of naturalistic studies. Although the authors are 
aware of the comparative advantages and disadvantages of experimen- 
tation and naturalistic (observation, correlation) studies, they show а 
definite preference for naturalistic studies. In fact, citing problems of 
adequate controls and ethics, they reiterate throughout the book their 


feeling that no more experimental studies should be conducted to 


demonstrate the reality of expectation effects. Another point which 
they continually stress is that most teaching 15 “reactive” rather than 
"proactive." i А 

Since this book is really about affective influences ІП education, at- 
tention is not restricted to teacher expectations. It is recognized that 
expectations are not independent of attitudes, and that the more 
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general question is how teacher and student attitudes and behavior in- 
teract to influence each other. In discussing the results of variables 
related to this interaction, some interesting conclusions are drawn, for 
example: 

1. Teachers prefer conforming, orderly students to assertive, in- 
dependent students. 

2. Teachers form strong first impressions of students on the basis of 
their early contacts with them. 

3. Teachers are usually not very aware of their own behavior 
toward students or how it affects the latter. 

4. There is little or no evidence that male teachers favor male stu- 
dents or that female teachers favor female students in any general 
sense. 

5. Male teachers appear to be more achievement-oriented than 
female teachers, and hence more interested in putting across the 
material than in determining whether students understand it. 

6. The more similar a teacher and student are, the more likely the 
teacher is to like the student. 

The authors do not stop at reviewing and discussing the research 
literature. In Chapter 9 they give a number of concrete prescriptions 
for promoting “proactive teaching," that is, making teachers aware of 
their behavior in the classroom and helping them change it. Further- 
more, Chapters 9 and 10 contain many interesting suggestions for 
research on expectations and attitudes. Finally, the implications for 
teaching listed in Chapter 11, especially those concerning the differen- 
tial treatment of students toward whom the teacher has low or high ex- 
ши, are useful even if they admittedly go beyond the research 

ata. 

Basically, what the authors suggest in the last few chapters is that 
teaching should be appropriate to the needs and personality of the 
learner. Of course, in a class of 30 students, and at the secondary 
school level several classes of 30 students, this is easier said than done. 
The usual situation is that some teachers manage to meet the needs 0 
some students—more frequently the bright, conforming, attentive: 
white, middle-class students who аге seated in the front or middle row. 
As the book makes clear, teachers trying to put across the lesson are 
not very sensitive detectors of what students know or are thinking: 
And, as anyone who has tried it knows, sensitivity training an 
behavioral feedback are not invariably effective, even when conducte! 
by an expert with highly motivated teachers. | 4 

Perhaps the statement іп the preface of the book that "Despite years 
of educational research, relatively little is known about the 
characteristics of effective teachers or the behaviors involved in eflec 
tive teaching." is too pessimistic in the light of the material covered Ш 
this book. To be sure, research on the characteristics and behavior 0 
effective teachers should continue. But the problem of effective 
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teaching appears to be less one of lack of knowledge of what to do and 
more a matter of how to do it. Brophy and Good have written a book 
that includes a good summary of research on teacher expectations and 
attitudes and variables associated with them. The book, however, is 
not limited to a description of the what of effective teaching; it also 
makes cogent suggestions as to how it should be done. As such, the 
book deserves serious attention by teachers, student-teachers, and 
educational researchers. Although not a textbook in the strict sense, it 
can be used with profit as associated reading in courses on teaching 
methods, educational psychology, and curriculum. 

Lewis В. AIKEN, JR. 


Oscar К. Buros (Ed.) Tests in Print II. Highland Park, М. J-: 
Gryphon Press, 1974. Pp. xxxix + 1107. $70.00. (no postage 
charges on prepaid orders.) 


This extremely useful volume is dedicated to the consistent 
reviewers of tests where consistent means contributing reviews to each 
of the seven Mental Measurement Yearbooks: Anne Anastasi, Howard 
R. Anderson, Walter V. Kaulfers, Victor H. Noll, and Arthur E. Trax- 
ler. Following Oscar Вигоз’ summary closely, the book lists 2,467 tests 
in print as of early 1974; 16,574 references through 1971 on specific 
tests; a directory of 493 test publishers with complete listing of their 
tests; a specific author index for each test with references; a title index 
which includes both in-print and out-of-print tests; à comprehensive 
cumulative author index to approximately 70,000 documents (tests, 
reviews, excerpts, and references) in Tests in Print II, the seven Mental 
Measurements Yearbooks, Personality Tests and Reviews, Reading 
Tests and Reviews, and a scanning index for quickly locating test 
designed for a particular population. Also included in the volume is a 
reprinting of the 1974 APA-AERA-NCME Standards for Educational 
and Psychological Tests. ei 

The “Expanded Table of Contents" presents а complete list of all 
categories under which tests have been classified. The number cited are 
test numbers, not page numbers. For example 323 pertains to Ше 
in the list of group intelligence tests while 483 pertains to the first liste 
individual intelligence tests. The number 323 to 482 appear in brackets 
above the title of each of the group intelligence tests. Such tests are 
listed in alphabetical order by title. Above the title College Board 
Scholastic Aptitude Test appears [357] immediately followed by a 
Paragraph descriptive of its function, testing candidates for college 
entrance, its dates 1926-73, the acronym SAT, its administration on 
specified dates established by the publisher, ETS, and its two scores 
verbal and mathematical and the citations of reviews and references in 
Mental Measurement Yearbooks 4, 5, 6, and 7. This is followed by 148 
complete and accurate references numbered from 420 to 567 in the 
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order of their publication. This is succeeded by a “cumulative name in- 
dex" with the names of authors in alphabetical order. This enables à 
reader to locate quickly a reference from knowledge of the name, or 
names of its authors. (The same name may pertain to more than one 
reference, e.g., Brigham C. C.: 420-3, 427, 432.) 

The PREFACE of Tests in Print II gives a brief history of Tests in 
Print I and along with the chapter headed INTRODUCTION pre- 
sents a history of the seven Mental Measurements Yearbooks (тот the 
Nineteen Thirty Eight and Nineteen Forty Mental Measurements Year- 
books to the Seventh Mental Measurements Yearbook and MMY 
monographs Reading Tests and Reviews (RTR) and Personality Tests 
and Reviews. Seven additional monographs are planned for 1975: 
English, foreign languages, intelligence, mathematics, science, social 
studies, and vocations. One cannot but be awed by the volume of 
business contemplated by Oscar Buros, his small staff, and his 
numerous contributors. 5 

Іп the INTRODUCTION аге seven interesting and informative 
tables: Table 1 shows the numbers and percentages of tests listed in 
ТІР I (1961) and in TIP II (1974) as classified by the EXPANDED 
TABLE OF CONTENTS inside the front cover and rear covers of TIP 
Il. The total numbers of tests are 2,126 (1961) and 2,467 (1974). Table 
2 reports the numbers and percentages of the 2,467 tests thus classified 
which are new or revised. Table 3 gives the numbers and percentages 
of tests classified by countries. In 1974, the United States had 2,204 or 
85.3%, Great Britain 181 or 7%. South Africa, Australia, and Canada 
has 64,53, and 50 tests 2.5%, 2.1%, and 1.9% respectively. Table 4 
gives the numbers and percents of reviews, excerpts, and references 
as categorized by the EXPANDED TABLE OF CONTENTS. Table 
5 lists the titles of the tests with 100 or more references through 1971. 
The Rorschach, the Minnesota Multiphasic (MMPI), the Thematic Ар: 
perception Test and the Stanford-Binet Intelligence Scale lead with 
4,578, 3,855, 1,765, 1,408 references respectively. These are closely fol 
lowed by the Edwards Personal Preference Schedule, the Strong 
Vocational Interest Blank for Men, and the three Wechsler Intelligence 
Scales. Table 6 gives the titles of tests with 23 or more references in thè 
years 1969-71. As noted above TIP 11 includes the APA-AERA- 
NCME Standards for Educational and Psychological Tests p 
Published in 1974. TIP II concludes with a PUBLISHERS DIREC: 
TORY AND INDEX, an INDEX OF TITLES, an INDEX A 
NAMES, and a SCANNING INDEX. “THE SCANNING INDE 5 
will probably be most useful in helping readers locate all tests in а ра! 
ticular area which are suitable for a given population . . .'(P- ll 
The introductory chapter concludes with advice on how to use TI н 
and specification of the Information presented, when available, 2600 
each test. The chapter concludes with a statement of the objective ai 
Oscar Krisen Buros and his colleagues in The Institute of Men 
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Measurements. As the president of Educational Testing Service, Wil- 
liam Turnbull, said about him “If Oscar Buros didn't exist, we would 
have to create him." 

Max D. ENGELHART 


Jeremy D. Finn. А General Model for Multivariate Analysis. New 
York: Holt, Rinehart and Winston, 1974. Pp. xviii + 589. $10.95. 


Some books remind one of a flock of sheep in a spring meadow: 
constantly leaping off in 20 directions but always, amazingly, a unified 
whole no part of which gets lost. Finn has written just such a book. He 
is concerned with multivariate analysis as a tool for conceptualizing 
and understanding behavioral phenomena. He is equally concerned 
with fitting algebraic models to multiple random (outcome) variables. 
He explores such models in multivariate multiple correlation and 
regression, canonical correlation, principal components, multivariate 
analysis of variance and covariance, and discriminant function 
analysis. He is also involved with formal aspects оГ estimation and 
hypothesis testing, including topics as arcane (from the point of view 
of behavioral scientists) as estimability. He strives to represent all 
data, all transformations, and all computations in standard matrix 
notation. He uses five large-scale, real-life, multivariate (largely 
behavioral) problems to illustrate relevant research areas, exemplify 
computations, and tell the research story from initial conceptualiza- 
tion and design to final inference and interpretation. He discusses 
computer techniques and devotes a lengthy appendix to the crucial 
Process that turns stacks of output into readable and accurate results 
and discussion in a journal article. Almost as a sideline, Finn teaches 
more matrix algebra than is contained in many mid-level texts. Final- 
ly, he provides timeliness by devoting considerable space to topics 
such as reparameterization of analysis of variance models which until 
now have been poorly or incompletely treated in the literature. 
Surprisingly, this unbelieveable potpourri works very well. 

The secret of Finn’s success is the discipline he brings to his presen- 
tation. He provides the reader with an overview of the techniques to be 
covered, he discusses in detail the sample problems which will be used 
to exemplify them, and then he sets about providing solutions for the 
problems one step at a time. 

Finn’s first sample problem (based upon data collected by 1. Leon 
Smith) is a small classic. The topic is the relationship between 
creativity and divergent achievement. Divergent achievement 
defined as a quantitative indicator of synthesis and of evaluation as 
determined by the tests of Kropp, Stoker, and Bashaw. The major 
question asked is, “. . . whether an individual's level of creativity 15 à 
determinant of divergent achievement, and further, whether this con- 
tribution represents an effect that cannot more parsimoniously be at- 
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tributed to general intelligence" (p. 11). Creativity levels are measured 
through three of Guilford's subtests: consequences obvious, conse- 
quences remote, and possible jobs. Intelligence is measured on the 
Lorge-Thorndike Multi-Level Intelligence Test (С-1). The two 
hypotheses to be tested are: (1) that the three tests of creativity and the 
two of divergent achievement represent a homogeneous set of 
cognitive processes: and (2) that levels of creativity determine to a con- 
siderable degree an individual's divergent achievement functioning. 
The first hypothesis is investigated through principal components 
analysis, and the second via multivariate multiple regression. As an 
added stroke of sophistication, Torrence's postulate that creativity has 
a greater effect when coupled with higher intelligence is 
operationalized through three additional independent variables: the 
interactions of the intelligence variable with the measures of divergent 
achievement. The inferential part of the multi-variate analysis of 
variance is conducted through logically ordered, sequential tests, and 
the latter three predictors are therefore relegated to the last positions 
in the predictor vector. 

The other four sample problems are a word-memory experiment, à 
study of dental calculus reduction, an essay grading study, and an ex- 
periment on programmed instruction effects. 7 

The first step in solving the sample problems is to set them up In 
matrix form. To demonstrate this process Finn begins with matri 
notation, progresses through simple operations such as multiplication 
and on to more complex ones including orthonormalization an 
Kronecker products, and concludes with the Cholesky factoring 
procedure and inversion techniques. Related procedures for 
calculating determinants, finding characteristic roots and vectors, and 
taking derivatives of linear functions are described. There are two 
faults with Finn’s presentation of this material. First, he continues 10 
use vector representation, presumably for the sake of clarity, far past 
the point where matrices would represent not only a more satisfactory 
notational condensation but also a much more satisfying conceptual 
framework. Secondly, several of the computational procedures are 
outlined in a way that keeps them close to the related computer 
routine, but robs them of the inherent simplicity that other algorithms 
would provide. The Cholesky factoring procedure is a case in point. 

Before proceeding to multivariate multiple regression analysis, Finn 
pauses to discuss multivariate data summarization procedures. He in 
troduces the expectation and variance operators, variance-covariance 
matrices, standardization, the multivariate normal distribution, ап 
conditional distributions and expectations. Considering the со 
pactness of this section, the treatment is good, although once again ! 
would probably have benefitted from more matrix representation an 
fewer vectors. In this chapter, as he does throughout the book, ЕД 
sets off illustrative computations іп half-tone boxes. Many of these eX 
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amples are excellent, but some are so compressed that they lose their 
effectiveness. 

One very worthwhile example demonstrates clearly the way in which 
failure to take into account mean differences among subgroups can 
profoundly bias correlations among variables. 

Chapter 4 deals with multivariate multiple regression analysis. The 
framework of this chapter serves as a blueprint for the treatment of the 
other techniques covered in the book. The univariate model is set 
forth, expressed in matrix terms, and a simple example provided. Next 
the estimation of parameters is discussed and conditions for es- 
timability listed. Dispersion and prediction in the univariate case fol- 
low. The multivariate model is developed in a like sequence. Com- 
putational forms are set forth, and the first sample problem is 
employed to illustrate the technique in detail. Tests of significance are 
presented in the following chapter and exemplified again with the 
creativity-achievement sample problem. The significance tests include 
not only univariate statistics but also the likelihood ratio test and 
various approximations to it. 

Step-down analysis is virtually the only approach to model-build- 
ing offered, and, while it provides a satisfying rigorous test procedure 
and avoids the quicksand of various error-rates, it is a definite depar- 
ture from Finn’s otherwise down-to-earth presentation of practical 
procedures, The sample problem allows for a logical ordering of in- 
dependent variables, but many research problems do not, and step- 
down analysis (basically sequential orthogonalization) is virtually 
Useless when a priori logical ordering cannot be specified. 

For those of you who are curious, analysis of the first sample 
problem yields beautifully straightforward results. In this sample of 
Observations, only general intelligence plays a significant role in 
divergent achievement. There does not seem to be a differential 
creativity effect either across intelligence levels or at any particular 
level. The univariate results show that “. . . synthesis is more affected 
by intelligence than is evaluation. The stepdown statistics indicate that 
in fact the more complex trait, evaluation, does not contribute to the 
association with the predictors. The relationship between the two sets 
Of measures is parsimoniously summarized in the correlation (+.64) of 
the two simplest constructs, intelligence and synthesis” (р, 17). 

Chapter 6 deals with correlation: simple, partial, multiple and 
canonical. A number of useful tests are discussed and several little- 
known but worthwhile procedures, such as the Olkin and Siotani test 
for the correlation of each of two variables with a third, are detailed. 

e very considerable difficulties involved in the interpretation of 
Canonical correlations are well illustrated. The section on use of prin- 
cipal components, however, is much too compressed both conceptual- 
ly and mathematically. Moreover, many factor analysts might argue 
With a substantial part of Finn’s interpretation. 
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The material in chapters 7 through 10 on multivariate analysis of 
variance and covariance is excellent. There are actually too many good 
features to attempt to list them all. Some of the more interesting and 
useful ones deserve at least brief mention, however. The material on 
the use of Kronecker's delta to generate comparisons has probably 
never been better presented. The treatment of reparameterization is 
very thorough. It includes selection of contrasts, construction of con- 
trast matrices, bases, estimability of interactions, and computations. 
The idea of nesting is expressed simply but correctly. The sample 
problems are apropos and generally well handled. 

The segment in chapter 10 on discriminant function analysis is ab- 
breviated in comparison to the other topics Finn covers, and appears 
to have been added almost as an afterthought. The dental calculus 
reduction study used as an example problem is a less than ideal choice: 
it lacks clarity and emphasis. 

One or two additional points should be mentioned. The book con- 
tains no exercises except for a set to accompany the chapter on matrix 
operations. The bibliography is not bad, but is neither comprehensive 
nor especially representative of the major literature in applied mul- 
tivariate analysis. The very lengthy appendix on the 
MULTIVARIANCE program is useful as a guide to interpreting and 
using complex computer output, but it is unduly bulky: it could have 
been reduced by two-thirds with little sacrifice in utility. As it stands, it 
gives more than a bit of the impression of being a long advertisement 
for a particular canned program. 

In summary, one must conclude that Finn has written an unusual 
but uniquely useful book. It does not, of course, provide anything 45 
ambitious as a formal general model for multivariate statistical 
analysis. What it does do is weave the many strands of applied mul- 
tivariate procedures including problem conceptualization, formal 
models, matrix representation and computations, and statistical es- 
timation and hypothesis testing together into a signle fabric that con- 
nects real-life multivariate behavioral research with the best Ш 
analytical procedures. А General Model for Multivariate Analysis wil 
be of use to many behavioral scientists and should achieve CON 
siderably popularity among them. 


James А. WALSH 
University of Montana 


Gerald L. Isaacs, David E. Christ, Melvin В. Novick and Paul 
Jackson. Tables for Bayesian Statisticians. lowa City, low: 
University of Iowa, 1974. Pp. 377. $15.00 (paperback). 


This new collection of tables was prepared at the Lindquist CHE 
for Measurement, The University of lowa. Its appeal to Bay 
statisticians arises in the choice of distributions and in the metho 
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tabulation. There are 20 tables devoted to eight distributions, together 
with another 24 tables of related functions and transformations. 

Two introductory sections describe the usage of the tables, the nota- 
tion, some definitions, some identities, some remarks on interpolation, 
and some examples. Further references are given to the recent text of 
Novick and Jackson (1974). 

The first tables include the usual reciprocals, squares, cubes, square 
roots and cube roots, positive and negative exponentials, and common 
and natural logarithms. Next are a table of natural logarithms of n- 
factorial and a table of binomial coefficients. These are followed by 
tables for converting degrees to radians and radians to degrees. This is 
done separately for degrees with decimals and degrees with minutes. 
The trignometric functions consist of the sine, the tangent, and the 
square of the sine function, together with tabulations of the inverses of 
these three functions. 

The next two tables are the hyperbolic tangent, and its inverse: the 
hyperbolic arctangent. The latter is the Fisher z-transformation. This 
transformation applied to a correlation coefficient produces a variable 
which may be almost normally distributed. 

A table of the log-odds transformation is given, that is 


ô = Іліг/(1 — т) 


is given, together with а table of the inverse transformation. If the 
prior distribution for parameter v is assumed to be Beta, then the 
posterior distribution for 6 is nearly normal. ) 

The cumulative normal distribution is tabulated, together with a 
table of percentage points (critical points), and with 10 pages of ran- 
dom observations (random normal deviates). The Student / distribu- 
tion is tabulated in full for degrees of freedom one through 25, 40, and 
60. A separate table is given for the percentage points. Sets of 500 
random observations drawn from the Student t distribution are given 
for each of 18 choices of the number of degrees of freedom. 

For the Chi-Square distribution the tabulation is by percentage 
Points for various degrees of freedom. The 25 percentage points 
chosen cover both tails of the distribution as well as the center. A table 
15 given of the end points of the interval of highest density for various 
specified probabilities and for specified degrees of freedom. In addi- 
tion there are sets of random observations. The Inverse Chi-Square 
distribution and the Inverse Chi distribution are listed in three tables 
each, following the format described for the Chi-Square distribution. 

The F distribution is tabulated by percentage points. The degrees of 
freedom for both numerator and denominator range up to 100, while 
the percentage points are chosen to give coverage ІП the center as well 
as in the tails. Next, a short table of Behrens-Fisher percentage points 
is given. Finally, there are three lengthy tables devoted to the Beta dis- 
tribution. The first gives percentage points, the second gives intervals 
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of highest density, and the last gives probabilities that one Beta 
variable is greater than a second. 

A concluding section presents the methods used to construct the 
tables, together with references to the computer subroutines used. 

Within the tables the values of the arguments are closely spaced. In 
most cases a comfortable number of significant digits are provided. 
Though differing in scope from the classic Biometrika Tables and the 
statistical tables and Fisher and Yates, the reviewer feels that the cur- 
rent volume may well approach them in importance. It should be 
noted that the funding for this extensive project was through the 
generosity of the Iowa Testing Programs of the University of lowa. 
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David A. Payne. The Assessment of Learning: Cognitive and Affec- 
tive. Lexington, Mass.: D. C. Heath and Company, 1974. Pp. 524. 
$9.95. 


In the words of its author, this book was written “to provide the 
ever-increasing number of classroom teachers and professional 
evaluators a a practical and efficient set of techniques to aid in evaluat- 
ing learning outcomes." From the preface it is assumed that this aim 
will be realized primarily by catching these two groups of individuals 
when they are undergraduate or graduate students in courses on “test 
construction or evaluative research methods." 

The author of this book is Professor of Educational Psychology at 
the University of Georgia and Review Editor of the Journal 0 
Educational Measurement. The book, a revision and expansion of his 
earlier paperback volume (Payne, 1968), is chock-full of facts and 
methods pertaining to educational assessment. Within its 524 ра 
are 18 chapters, several appendices including some useful statistica 
tables, a glossary of terms, and the usual prefatory and index materi?" 
The chapters range in length from 12 to 46 pages, and are divided ipe 
six sections: І. An Overview (Chapt. 1), II. Planning for Instrumen 
Development (Chapts. 2-4), III. Instrument Development (Chapts: 
5-8), IV. Summarizing and Interpreting Test Performances (Chapt 
9-10), V. Instrument Refinement (Chapts. 11-12), VI. Other Source 
and Uses of Assessment Data (Chapts. 13-18). put 

Cognitive objectives of learning are emphasized in the book, du 
somewhat unusual and timely is the comprehensive treatment ofa еі 
tive and psychomotor objectives. For example, five separate chapte 
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аге devoted to the specification and measurement of the affective out- 
comes of education. The book possesses several other unusual 
features, but whether or not they are meritorious depends on one's 
point of view. It could be argued that the design and style of the 
chapters—summaries at the beginning rather than the end; many lists, 
tables, flowcharts, etc.; technically competent but cumbersome and 
unflowing prose—is too reminiscent of journal articles in APA format. 
Personally, I found the emphasis on structure and handbook-like 
detail somewhat unstimulating and distracting at times. Whether 
fledgling students who frequently go through statistically-oriented 
courses in a hazy-dazy condition will appreciate this style is guess- 
work, Students do not invariably prefer the same textbooks as their 
professors, but it is usually wise to err on the simple side! 

Even with its shortcomings, there are numerous positive features to 
this book. Chapters 2 and 3 present a thorough coverage of 
educational objectives, Chapter 5 contains many helpful illustrations 
of good and poor item writing, and Chapter 8 gives excellent descrip- 
tions of self-report affective items and inventories. Chapter 10 on test 
interpretation, Chapter 13 on criterion-referenced measures, and 
Chapter 16 on the assessment of affective, performance, and product 
Outcomes by direct observation are also noteworthy. I was also 
delighted to see a glossary in the appendix. 

On the other hand, rather than taking up so much space in Chapter 
14 with critical reviews of standardized achievement tests, it would 
probably have been more helpful to harassed teachers if the author 
had simply given his recommendations as to which tests are most ap- 
Propriate for particular situations and some comments on the uses of 
Standardized achievement tests in the schools. Furthermore, I cannot 
imagine that prospective teachers would be interested in all of the 
details of specific instruments presented in Chapter 15. р 

Among the topics receiving little ог по attention, but deserving 
more, are the use of tests in accountability and performance contract- 
Ing, gain scores, ethical and ethnical issues, the use of tests in prescrip- 
live teaching as well as diagnosis of learning difficulties, and formative 
Vs. summative evaluation. Chapter 9 is а good overview of statistics 
for testing, and unlike some authors this one did not confuse the defi- 
nitions of percentile and percentile rank. Unfortunately, he was incon- 
Sistent in his definitions of variance and standard deviation: in the 
summary it's М, and in the chapter it’s N-1 in the denominator. A set 
Of problems and exercises, especially in the more statistical chapters, 
Would also have been nice. 

Basically, David Payne has written а 
and reference book on assessment and z 
has done a good job of representing the current state of the art in 
educational testing. The book will serve well as a handbook or 
тезоигсе book for teachers and professional evaluators, and should 


combination “how to do it" 
evaluation for educators. He 
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also have its share of adoptions for courses on “educational,” if not 
"psychological," measurement and testing. In designing a textbook 
for a first course in educational testing, however, the author and editor 
would have fared better if they had given more thought 10 
motivational features. As the author would undoubtedly agree, affect, 
as well as cognition, must be taken into account by educators of every 
stripe—including textbook writers. 
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David A. Payne and Robert Е. McMorris (Eds.). Educational and 
Psychological Measurement, Contributions to Theory and Practice. 
(2nd. ed.) Morristown, М. J.: General Learning Corporation, 
1975. Pp. хх + 397. $6.50 (paperback). 


The first edition of this outstanding anthology provided students 
and teachers “convenient and readily accessible summaries of concern 
10 the undergraduate or graduate student of educational and psy- 
chological measurement." Comparison of both editors reveals that 
the editors have made numerous changes from the first edition to the 
second. Twenty-eight of the quoted selections are new while eighteen 
have been retained from the first edition. A few of the selections new 10 
this anthology, actually appeared years ago. These include the late 
Percival Symond’s “Factors Influencing Test Reliability (1928). 
Stephen Corey's “Measuring Attitudes in the Classroom (1943) and 
Anne Anastasi's "Some Implications of Cultural Factors for Test 
Construction (1949),” : 

Among the major contributions appearing in both editors i 
"Measurement and the Teacher," **Evaluating Content Validity, an 
"The Social Consequences of Educational Testing" by Robert = 
Ebel; "Response Sets and Test Design" by Lee J. Cronbach; No 
vergent and Discriminant Validity" by Donald T. Campbell ang 
Donald W. Fiske; “Guidelines for Testing Minority Group Children, 
by Martin Deutsch and others, and this reviewer's "Suggestions for 
Writing Achievement Test Exercises," and “Моп-Аррагепі Limiit 
tions of Normative Data" by Junius А. Davis. Among the influentia 
selections appearing in the second edition only are “Тһе Validation о 
Educational Measures,” and "Course Improvement Throug 
Evaluation" by Lee J. Cronbach, “Concepts of Achievement r 
Proficiency,” by William E. Coffman and “Implications of Criterio 
Referenced Measurement," by W. James Popham and T. R. Hu 
Also influential and timely are “Testing for Accountability" by Вар 
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W. Tyler, “Testing Hazards in Performance Contracting” by Robert 
E. Stake; “Expectancy Tables—a Way of Interpreting Test Validity” 
by Alexander G. Wesman. 

Other commendable features of the second edition of this anthology 
are a list of 24 references “Current Textbooks in Tests, Measurement, 
and Evaluation," a “Textbook Reference Chart" which correlate the 
chapters or selected articles in the second edition of this anthology 
with the relevant chapters of the 24 textbooks. The anthology con- 
cludes with a bibliography of 334 references and an excellent glossary 
of 148 measurement terms. 

In concluding this generally favorable review it seems unfortunate 
that most of its title so closely is that of our journal. I suspect that the 
publisher, rather than the author, is to be credited with the grotesque 
cover design. More suitable to the planet of the apes. 

Max D. ENGELHART 


Herbert J. Walberg (Ed.) Evaluating Educational Performance: A 
Sourcebook of Methods, Instruments, and Examples. Berkeley, 
California; McCutchan Publishing Corporation, 1974. Pp. xxii + 
395. $12.00. 


This collection of 19 papers addresses a variety of topics within the 
program evaluation/accountability realm of educational research. The 
editor's stated goal is to provide a practical orientation to evaluation 
of educational system effectiveness as a basis for policy formulation 
ànd decision making. The perspective is the macro-level of analysis 
with the school as subject. While the editor is the principal contributor 
(author or co-author of 7 chapters) his influence is even more per- 
vasive. The majority of the contributors are/or have been affiliated 
with Chicago Circle, Chicago, Northwestern, Illinois, or Wisconsin 
and many chapters report research conducted in the Chicago public 
Schools. Three of the chapters are reprinted and several others are 
based on previously delivered presentations. In chapter 1 Walberg 
gives a brief explanation of the purpose of the book followed by over- 
views of the remaining chapters. ut 

Two-thirds of chapter 2 by Glass is devoted to an extended critique 
9f the PMM (Popham-McNeil-Millman) method of teacher evalua- 
lion; the primary deficiency is low reliability, which also rules out 
Standardized testing of pupils as a method of assessing teacher effec- 
liveness. In chapter 3 Brophy outlines an intensive (over 1000 
Variables) investigation of teacher behaviors that may produce student 
achievement (as reflected in residual gain scores derived from standard- 
ized tests). Tentative results suggest that teachers' managerial skills, 
Student attention level, and uncrowdedness of room are related to 


Pupil achievement. 
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The Barclay Classroom Climate Inventory (BCCI), a co 
scored, diagnostic instrument which assesses pupil deficits i 
areas, is described by its author in chapter 4. In chapter 5 Niel 
Kirk provide brief descriptions of five observation instrumen 
five self-report questionnaires for assessing classroom climates. 
review of relevant studies revealed no clear-cut relationship betwee 
climates and student achievement. This conclusion is seemingly: 
tradicted by Anderson and Walberg in the following chapter, 
concluded that the Learning Environment Inventory (a 15 scale, 
report measure of the “high inference" variety) accounted for sub 
tial variance in measures of student learning. Are classroom cli 
and learning environments really different, or is the LEI simply a 
ter instrument than the others? ( 

In Chapter 7 Johnson describes the Minnesota School Affect Ass 
ment (MSAA), an instrument designed to evaluate the effectiveness of 
schools in meeting educational objectives in the affective doma 
Chapter 8 by Welch and Walberg, which is reprinted from AERJ, 
summary of a nation-wide, experimental evaluation of Project Р ics 
While there were no differences on the cognitive criteria, Proj 
Physics students did report more favorable reactions to the no 
cognitive aspects of the course. In another reprinted chapter, 
describes an instrument for the assessment of instructional materi 
reliability was low in the field test however. 

Van Hove, Coleman, Rabben, and Karweit summarize 
analyses of routinely published achievement test data and two i 
of racial integration for elementary school children in six | 
American cities in chapter 10. They did not find a consistent inte 
tion effect, although there were specific effects within cities and 
levels. Chapter 11 by Jensen (reprinted) presents a detailed report 
his investigation of the determinants of white/nonwhite differences] 
scholastic achievement in a California elementary school distri 
When the ethnic-racial groups were statistically equated | 
танааш and ability variables, differences in achievement disi P 
peared. 

Chapters 12 through 16 summarize the results of a series of stu! 
conducted in the Chicago public schools by Walberg and his 
sociates. Walberg and Bargen report a limited study of educati 
equality in chapter 12. While attendance patterns Were 
segregated with minority schools evidencing lower achievement 
inferior teachers, expenditures were fairly equal across schools. I 
next chapter the same authors present the results of multiple ге 
sion analyses to explain variability in reading achievement at! 
grade levels. Not unexpectedly, earlier achievement accoun 
most of the variance in later achievement. The regression €q 
were used to select two high- and two low-achieving elem 
schools. In chapter 14 Talmage and Rippey discuss the results of 
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observations of these four schools: their findings were generally not 
consistent with previous studies. Powell and Eash report a similar 
study of two high- and two low-achieving high schools in chapter 15. 
While the anecdotal descriptions in both chapters are interesting, the 
methodology is suspect on two counts: (1) the criterion (adjusted gain 
in reading achievement) is a statistical extrapolation, and (2) the sam- 
ples are too small for anything except a pilot study. The last Chicago 
study by Coughlan and Cooke used the same methodology as the 
previous studies. A work attitudes survey was completed by teachers 
in high- and low-performance elementary schools and the results were 
discussed. 

The next three chapters outline recently developed graphical 
methods which may be used to summarize geographically-related 
data. In chapter 17 Mclsaac reviews trend surface analysis and gives 
three examples of applications. Spuck provides an overview of 
geocode analysis, a procedure for summarizing student data. Both 
trend surface and geocode analysis summarize data in the form of 
"contour maps." In chapter 19 Walberg and Bargen describe a regres- 
sion analysis of the Chicago elementary school data using three spatial 
models (concentric, status, and sector) to explain differences among 
schools. 

The final chapter by Walberg (a 1972 AAAS presentation) is an 
interesting, if not somewhat rambling, review of a variety of issues 
which relate to the future of education, e.g., genetic versus environ- 
mental explanations of behavior, increased intelligence in the popula- 
tion, social and economic correlates of intellectual development, 
effects of education on basic abilities, etc. He concludes the chapter by 
questioning the reliance on standardized tests as the most important 
measures of educational outcomes, suggesting that they "probably do 
far more harm than good.” Ironically, Walberg ends the book express- 
ing a viewpoint that undermines the validity of the principal criterion 
employed in many of the previous 19 chapters. | | 

Іп summary, the volume contains a variety of review articles and 
research studies which summarize instruments and findings and il- 
lustrate various research strategies and methodologies in educational 
program evaluation. The collection is uneven in style and quality, but 
this is unavoidable with contributed books. The value of the volume 
as а sourcebook is reduced due to the absence of a subject index. 

BRIAN BOLTON 
University of Arkansas 
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RANDOM VARIABLES AND CORRELATIONAL OVERKILL 


JOSEPH T. KUNCE AND DANIEL W. COOK 
University of Missouri-Columbia 


DOUGLAS E. MILLER 
Michigan State University 


Research findings may be more publishable if significant results 
are reported. This type of publication bias would increase the 
likelihood of “chance” relationships being disseminated. The im- 
plications of these assumptions are empirically investigated in a cor- 
relational analogue study. A large number of significant 
relationships were found in several groups of subjects between their 
actual scores on 45 SVIB scales and scores on 10 “experimental” 
scales which were determined by a set of random numbers. Further- 
more, “logical” factors were shown to underly relationships which 
existed among scores on a given random scale with its significant cor- 
relations to SVIB scales. Considerations in such overkill in simple 
correlational studies are the subject-to-variable ratio, variable in- 
dependence, and more stringent probability levels. 


How much significance should one place in published “significant 


findings?” Bakan (1967), Bozarth and Roberts (1972), and Cohen 
(1962) have identified a publication bias that favors acceptance of 
research studies showing significant findings almost exclusively over 
nonsignificant results, Such an occurrence тау create a situation 
Where journals are “replete with research reports that have resulted in 
Significant findings by chance factors alone (Way and Larrimore, 1973, 
P. 362).” Another factor affecting significant findings is statistical 
overkill (Kunce, 1971). Here the use of multivariate procedures (е.в., 
multiple correlations, factor analyses) and а subject-variable ratio less 
than 10 to 1 produces results that can be anticipated not to generalize 
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to other similar populations (Miller and Kunce, 1973). The purpose of 
the present investigation is to design an analogue study to investigate 
the incidence and implications of chance correlations upon publica- 
tion bias and “overkill” in a simple, zero-order correlational design. 


Method 
Subjects 


Strong Vocational Interest Blank test scores (Form T-399) were 
available from a follow-up study of graduates from the University of 
Missouri-Columbia (Kunce, Dolliver, and Irwin, 1972). Of these sub- 
jects, 163 had participated fully in the research and 57 had not. 
Complete SVIB data obtained in college, however, were available for 
all 220 subjects on 45 of the SVIB scales. The 220 subjects were divided 
into four groups of 55 each. Group A consisted of 55 who had not 
fully participated in the follow-up study and Groups B, C, and D each 
had 55 subjects randomly assigned from the remaining 165. 


Procedure 


Each subject's data record consisted of his 45 SVIB scale scores plus 
scores from 10 new "experimental" scales, The scores of these experi- 
mental scales were, in reality, values extracted from a table of random 
numbers, Therefore, each subject had scores on a total of 55 variables. 

All of the subjects’ random scores on experimental scale #1 were 
correlated with their actual scores on each of the 45 SVIB scales for a 
total of 45 correlations. This procedure was repeated for each of the 
remaining nine experimental scales with the 45 SVIB scales. The 450 
resulting correlations were computed separately for Groups, A, В, С, 
and D and for all Broups combined. For the separate groups the ratio 


of subjects to variables were 1:1 55 to 55) and th bined group 
4:1(220 to 55). Pa etr n 


Results 


The number of correlations significant at or above the .05 level (2- 
tailed test) obtained for Groups A, B, C, and D and for the combined 
groups are presented in Table 1. Altogether, 17 of the intercorrelations 
for the 10 experimental scales with the 45 SVIB scales were significant 
for Group A;26 for B; 34 for C; and 42 for D. Therefore, Groups В, С, 
and D had a total number of significant correlations higher than that 
expected by chance (i.e., assuming 23, or 5%, of the 450 intercorrela- 
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TABLE 1 
су of Significant Correlations for Each Random Scale with 45 SVIB Scales 


Group 
Random A B с р All 
scale (N=55) (№= 55) (М-5) (N55 (М-220) 

1 0 1 0 0 0 

2 4 1 0 12 3 

3 4 4 3 11 4 

4 2 2 0 8 7 

5 4 0 8 7 1 

6 0 1 6 2 3 

D 0 7 0 1 0 

8 0 3 0 0 1 

9 1 0 6 0 2 

10 2 7 11 1 1 
TOTALS 17 26 34 42 22 


er group would be significant). The number of significant cor- 
ns (n — 22) obtained for the four combined groups was essen- 
y identical to that expected using a 5% chance significance rate. 


Discussion 


ume that these data from Groups A, B, C, and D represent ac- 
findings from each of four researchers who independently had 
uated real, not random, scores on 10 experimental scales. Which 
Of the researchers, then, would have the best chances of having his 

ts accepted for publication іп a journal? The results of the findings 
Groups C and D, and to a lesser extent B, would appear to have the 
lest chance of being considered favorably because of the relatively 
‘number of significant correlations. The acceptance would de- 
additionally, upon the “Jogicalness” of the relationships 
ed. To explore this issue, the statistically significant intercorrela- 
obtained between scores on experimental scale #10 and the SVIB 
for Group C (see Table 1) were arbitrarily examined. These cor- 
ns were as follows: 


t +.26 


+.29 

+.29 

+31 
Personnel Director — 40 
Public Administrator —.38 
Social Science Teacher —.33 


City School 
Superintendent 221 
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Minister —.28 
Mortician —.28 
Life Insurance 130 


From the nature of these relationships one could infer that experimen- 
tal scale #10 represents a "things-people" dimension. Subsequent 
analysis of other random scales showed similar dimensions. 

The seemingly logical relationship of scores on random scales with 
SVIB scales may be a consequence of the fact that the SVIB scales 
themselves are not independent of each other. In the above example 
the majority of the correlations between architect, dentist, physicist, 
and farmer with the other scales could be expected to be negative. 
Other research studies using similar measures that lack scale in- 
dependence could, likewise, generate a “multiplication effect" of 
‘hance relationships. 

Ironically, the results of our hypothetical researchers C and D 
would appear to have the best chance of getting published 
promulgating “false findings;” whereas, the data from Group A which 
may contain “truer” results (i.e., within the 5% chance expectancy 
range) is not likely to get published. The probability of false results be- 
Ing generated and disseminated would likely be reduced if a higher 
Subject/variable ratio is used. For example, for the combined group of 
Subjects having a subject/variable ratio of 220:55, the number of 
Statistically significant correlations, N = 22, did not exceed that ex- 
Pected by chance alone. Even here caution should be exerted with 
regards to individual scales and their “validity.” Experimental scale #4 
(see Table 1) for the combined groups correlated significantly with 
seven of the 45 SVIB scales. This could be interpreted as three times 


the chance level, assuming that 5% of 45 (or approximately 2) correla- 


tions could be anticipated to be significant. 

The results of this study Support the position that many studies ac- 
cepted for publication may have largely chance findings in spite of 
their reported statistical Significance. The face validity of chance 
Significant findings resulting from scale interdependence further masks 
the true relationship. Several Considerations may reduce premature ac- 
ceptance of chance findings in simple correlational studies: 


1. Increase the number of subjects in relationship to the number of 
Variables to at least 10:1 as in multivariate studies. 


2; Use more stringent Probability levels when low subject/variable 
ratios are unavoidable, 


3. Use independent validity generalization samples before 
Publishing data based on low subject/variable ratios. 
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We are not in a position, nor have the expertise, to determine how 
many significant correlations are needed in a correlational table to be 
significantly greater than chance when scales are not independent. Un- 
doubtedly, this determination could be a complex procedure and 
would vary according to the degree of interscale independence. The 
computational complexities involved or the rigid interpretation of 
guidelines could lead to suppression of new and potentially useful psy- 
chological findings. 

Greater appreciation and awareness of the realities of statistical and 
correlational overkill, rather than complex mathematics, may be suffi- 
cient to reduce premature acceptance of superficially impressive 
findings. This conclusion is exemplified by the recommendations of 
other investigators. Tversky and Kahneman (1971) cautioned against 
the expectation that findings from small samples will generalize to 
other small replication samples even if both are thought to be 
representative of the population. Furthermore, if replication samples 
are smaller than the developmental sample, the power of the test is 
diminished which reduces the chance of gaining significant results, 
Lykken (1968) believes that significance level is the least important at- 
tribute of a good experiment and is not sufficient for indicating 
Whether a theory has been corroborated, an empirical fact confiden- 
tially established, or whether a study should be published. He feels all 
experiments ideally should be replicated before publication. And, 
finally, Winer (1971, p. 14) has taken the position that "no absolute 
standard can be set up for determining the appropriate level of sig- 
nificance and power that a test should have. The level of significance 
used in making statistical tests should be gauged in part by the power 
of practically important alternative hypotheses at varying levels of sig- 
nificance,” 
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INDEPENDENCE PROBLEMS FOR CERTAIN TESTS 
BASED ON THE SHINE-BOWER ERROR TERM 


LESTER C. SHINE И 
Texas A & M University 


It is shown for the Shine-Bower single-subject ANOVA that the 
numerator and denominator of all F tests based on the Shine-Bower 
error term are independent of each other. It is also shown that the 
same property holds for all such tests in the Shine Combined 
ANOVA except for the test for the trial by subject interaction. 


Tuis article is intended to clarify the independence of the numerator 
and denominator of certain F tests which use the Shine-Bower error 
term (MSE’) as a denominator. These tests occur in the Shine-Bower 
single-subject ANOVA (Shine and Bower, 1971; Shine, 1973b) and in 
the Shine Combined ANOVA (Shine, 1973a, 1974). It has been in- 
dicated by Shine (1974, footnote on p. 50) that a lack of independence 
occurs only in the Combined ANOVA test for an interaction between 
trials and subjects. 


Shine-Bower Single-Subject ANOVA (General Case) 


MSE’ is used in this fixed factor design to test all sources of variation 
except the main effect of trials. It is assumed for MSE! that the main 
effect of trials changes slowly across trials. Independent additive con- 
tributions to MSE! are obtained by squaring and summing every other 
successive difference between trial means (an even number of trials is 
assumed), after collapsing across all other factors. These successive 
differences themselves may be expressed as successive differences 
between the standard linear forms representing trial main effects 
(Scheffé, 1959). This set of trial main effect linear forms is orthogonal 
to any of the sets of standard linear forms on which mean squares for 
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the other sources of variation in the design are based (Scheffé, 1959). 
Consequently, it follows immediately from the normality and in- 
dependence assumptions of the Shine-Bower ANOVA that MSE' is 
statistically independent of all mean squares in the design except for 
the trial main effect mean square. 


Shine Combined ANOVA (General Case) 


This design is effectively a standard repeated measures design in 
which MSE'is used only for testing subject sources of variation. MSE' 
is obtained by pooling across suitable subjects the corresponding sums 
of squares and degrees of freedom associated with each individual sub- 
ject's Shine-Bower error term. MSE' is clearly a function of changes 
across trials within subjects and within the nonrepeated factors, after 
collapsing across all repeated factors. Thus, MSE’ must Бе a function 
of the standard linear forms for the main effect of trials, for all inter- 
actions involving only trials and one or more of the nonrepeated 
factors, and for the nested trial by subject interaction (the subject 
factor is nested in the nonrepeated factors). In the sense stated above, 
these forms are orthogonal to all other sources of variation in the 
design (Scheffé, 1959). Consequently, under the normality assump- 
tion and under the usual homogeneity assumptions for variances and 
covariances (Winer, 1971), MSE' is statistically independent of all 
subject effect mean squares except for the nested trial by subject inter- 
action mean square. 


Conclusion 


Only the F test in the Combined ANOVA for testing the nested trial 
by subject interaction (non-nested for the special case of a completely 
repeated design) presents a problem regarding the independence ofthe 
numerator and denominator (MSE’). This conclusion holds whether ог 
not any null hypotheses are true and whether or not the slow change 
assumption for MSE’ is met. In any case, the F test for the trial by sub- 
Ject interaction must be regarded as very approximate in nature. The 


ipa author is currently investigating the exact distribution of this 
statistic. 
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ON THE INDEPENDENCE OF VARIABLE SETS 


CHARLES D. DZIUBAN, EDWIN С. SHIRKEY, AND THOMAS О. PEEPLES 
Florida Technological University 


An illustration of a test for independence was provided with a 
mixed set of variables. The matrix consisted of 10 tests of interest 
and four random deviates in which the relationship between sets was 
demonstrated to be minimal. The result was discussed for a situation 
in which factoring methods might be considered. 


RECENTLY, Shaycoft (1970) presented a clear example in which 
results yielded by the method of principal components would lead to 
ап erroneous interpretation. The procedure was applied to a “mixed” 
set of variables—ten measures of interest from project TALENT and 
four random deviates (М = 3689). The obtained results would have 
forced one to attempt an interpretation of random variables as the 
basis of a meaningful component. 

Dziuban and Harris (1973) illustrated that such a result would be 
guarded against by application of the image and factor analytic 
models. They noted that the random variables did not warrant any in- 
terpretation when those procedures were applied to the “mixed set. 
Their recommendations included abandonment of principal compo- 
nents in favor of factor analytic and image methods when one has 
Teason to believe that some variables in a set are essentially random. 

In a case such as the one presented Бу Shaycoft, it might be expected 
that the ten project TALENT and four random variables, the two sets, 
Were independent of each other. That hypothesis might be tested using 
the Heck largest root distribution (Morrison, 1967). If the largest 
eigenvalue of the matrix Ru ‘RaRa Каз! exceeds the 100 а percen- 
tage point of the distribution, the independence hypothesis may be re- 
Jected. 


In this case Ry, comprised the correlations among the TALENT 
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TABLE 1 
The Matrix Ry Ry Ra! Ry! 


1 2 3 4 5 

5.2.x 10-5 19 x 10-4 6.1 X 1074 27X10* — -40x10-* 
=10 x 10-8 1.1 x 107 L5x10* -51х10- —43 X 10-* 
—5.5Х 10-4 64 x 10-* 101052  —-26x10*  -38x10-* 
1.2 х 10-* 1.5 x 10-5 50 x 10-4 80 x 10-* —44 X 107* 
2.5 X 10-8 2.6 X 1075 8.2 X 10-8 74X10* — —68x10- 
—6.1 X 10-4 50 x 10-4 11x103 -75Х104 — -4.4x 10-4 
=1.2 x 10-3 1.1 X 10-8 L6Xx10* — —99x10*  -45xl10-* 
—3.7 x 10-4 3.2 x 10-4 6.1Х 10-4  -57x10*  -19x10-* 
о S 0 4.1 X 10-4 44Х104 -10х10- 3.8 x 107° 
-22х10- 2.7 x 10-4 5.2Х 10*  -33x10* —91Х10-—* 

6 7 8 9 10 

9.0 x 1075 3.3 x 10-4 5.2 x 10-4 —5.5 X 10-4 2.0 x 107* 
1.8 x 10-4 1.5Х 10-8 6.2 X 10-4 —6.5 X 10-4 —6.3 X 10-* 
-22 X 10-4 6.6 X 10-4 T3x10* — -32х10- 7.0 x 10-4 
729 x 10-4 1.6 X 10-5 $1x10* -61х10- 5.2 x 10-* 
-22X10* -11х10- 1.7 X 107* 1.4 X 107^ 5:3 x 10-4 
8.3 x 10-4 1.2 x 107* 7.0 x 107* -6.0 x 10-* —5.0 X 107* 
54x 10-‹ 18 x 10-8 73Х104 — -63x10* -7.8 X 10-* 
-70X10*  -39х10- 5.6 x 10-4 1.4 x 107° 8.4 x 107* 
—3.8 X 107* 2.6 X 107* 4.2 X 10-4 50 x 10-4 1.2 X 107° 
-52X10* 30 x 1075 6.8 X 10-4 1.2x 10-* 1.6 X 107? 


variables, Rs, those among the random variables, and R, their inter- 
correlations. Referring to the largest root distribution, the value 
needed to reject the hypothesis o (= .01) is approximately .025. The 
largest obtained eigenvalue of the matrix was .0035 so that the in- 
dependence hypothesis would not be rejected." This result suggests 
that if one has reason 10 suspect that some variables in a set are un- 
related to the domain of interest, assessment might be made prior to 
any “factoring” procedures, 
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' Largest canonical correlation = .059 (x? = 31.96, D.F. = 40). 
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SAMPLING CHARACTERISTICS OF KELLEY'S 
& AND HAYS’ û 


ROBERT M. CARROLL 
U.S. Army Research Institute for the Behavioral and Social Sciences 


LENA A. NORDHOLM 
Indiana University Northwest’ 


Statistics used to estimate the population correlation ratio were 
reviewed and evaluated. The sampling distributions of Kelley's е 
and Hays’ 6? were studied empirically by computer simulation 
within the context of a three level one-way fixed effects analysis of 
variance design. These statistics were found to have rather large stan- 
dard errors when small samples were used. As with other correlation 
indices, large samples are recommended for accuracy of estimation. 
Both e and w? were found to be negligibly biased. Heterogeneity of 
variances had negligible effects on the estimates under conditions of 
proportional representativeness of sample sizes with respect to their 
population counterparts, but combinations of heterogeneity of 
variance and unrepresentative sample sizes yielded especially poor 
estimates, 


IN spite of the consistent emphasis on p-levels by editors of psy- 


chological journals, some researchers have concerned themselves with 
ween the independent 


the question of the strength of relationship bet 
and dependent variables in comparative experiments. This concern 18 
justified and should be encouraged, since the emphasis on p-levels 
alone may lead to exploitation in the form of reporting trivial effects 
due to large sample sizes. It is questionable, however, whether the need 


for “practical significance” is best served by estimates of the strength 


! This study was supported by a grant for computer time from the Computer Science 
Center, University of Maryland. An earlier draft of this paper was read at the 
Midwestern Psychological Association Meeting at Cleveland, Ohio in May 1972. This 
study was conducted while the authors were at the University of Maryland. 
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of the relationship such as the correlation ratio computed on a sample 
or some of the other estimators of the population correlation ratio аз | 
given by Pearson (1923), Kelley (1935), or Hays (1963). i 

Originally, a measure known as the correlation ratio ог 7? (eta 
squared) was developed to give a descriptive index of the total 
relationship in a single factor fixed effect design, in contrast to 7*, 
which only indicates linear relationship. In the analysis of variance 
notation n? was defined as: 


: 
1. 55: — 555 858% a) 
2 $$; GE 


where: 


SS; = total sum of squares. 
SSw = sum of squares within groups. 
55в = sum of squares between groups. 


The. population value of n? was similarly defined by substituting pop- 
ulation sums of squares for their sample counterparts, which lead to 
the formula: 


2 Cy — в 

"pop = ILI ы (2) 
where: 
7 the variance of the dependent variable. 


$ = з 
с = the common homogeneous variance of the Ух, about иу. 


Ч However, recognizing that л (Formula 1) is not an unbiased es- 
timate of the correlation ratio in the population, Pearson (1923) sug- 


gested an improved (less biased) large sample approximation of the 
correlation ratio in the population. His estimate was: 
2 
анаан С и) o в 
1—(J—3)N 

ДА = estimate of population correlation ratio. | 
No = observed correlation ratio on sample. | 
J — number of data arrays. 
N = total sample size. | 
| 


Kelley (1935) proposed another estimate of the population correlation 
EU which he believed to be unbiased. This statistic, which he called 
€ (epsilon Squared) was formulated by substituting what Kelley 
thought were unbiased estimators for oy? and c? in Equation 2. His €s- 
timate of the population correlation ratio so determined was: 


è = SS1/(N = 1) — SSy/(N — J) o 


SS,/(N — 1) 
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where: 


N = total number of observations taken. 
J = number of data arrays. 


Glass and Hakstian (1969) pointed out that epsilon squared is actually 
not an unbiased estimate of the population correlation ratio since the 
ratio of unbiased estimators is not generally an unbiased estimate of 
the ratio itself (see Olkin and Pratt, [1958]). 

Hays (1963) derived still another measure of the strength of 
relationship between a categorical variable X and the dependent 
variable Y. This index was called « (omega squared) and was defined 
for the population as: 4 ; 
= "pam (5) 
where: 


oy? = marginal variance of Y. 
туы? = conditional variance of Y given any Xt 
This is equivalent to the definition of the correlation ratio defined for a 
population. According to Hays (1963) a "rough" estimate of the pop- 
ulation correlation ratio is given by: 
Ar 55, — (J = DMSy (6) 
55. + М8» 
where: 


SS = sum of squares total. 
SS, = sum of squares between groups. 
MSw = mean squares within groups. 

J = number of groups. 


Glass and Hakstian (1969) dealt with the relationship between and 
27 and showed that epsilon squared could be written 45: 


au 58» = (= DMSy (1) 
SS, 


The difference between Formulas 6 and 7 can be seen in the 
denominators. Glass and Hakstian suggested that this difference is due 
to the varying definitions of oy’ that Hays and Kelley employ. We con- 
tend that the definition of oy? is not equivocal, and that Equations 6 
and 7 differ due to the different estimates of oy? used by Hays and Kel- 
ley. Glass and Hakstian pointed out that E[SSr/(N = 1)] = ee + Zn 
«?/(N — 1) where М, n, are sample values and ау is the effect of treat- 
Ment j, Hays defined оу? = Ge + >паД/М in a single factor fixed 
effect design but under the restriction that the relative sample sizes in 
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the various treatment groups be equal to the probability of randomly 
observing a case from the corresponding population. Consequently 
Hays’ definition of cy? is not dependent upon sample size as he re- 
quires п// to be fixed. However E[SSz/(N — 1)] does depend on sam- 
ple size as л//(М — 1) will vary with different sample sizes even when 
the proportional representativeness of the samples equals its popula- 
tion counterpart. One obviously cannot define a population parameter 
whose value is dependent upon the sample size used to estimate it. 
Therefore, оу? cannot be appropriately defined as о.? + Znja?/(N — 
1). Hays' definition is the only one appropriate and consequently 
SSr/(N — 1) isa biased estimate of су? within the context of a one-way 
fixed effects model. Hays developed his definition of о? by assuming 
the following fixed effect model: 


Үу-и”а) “е, (8) 


where na, = 0 with the ej, independently and normally distributed 
with expectation zero and variance ø, By definition в)” = E(Y, — uy 
= Ela, + ey} = E(u?) + Ее?) = oe + Е(о 2). The term a, is а dis- 
crete random variable with probability n,/N of taking on the value a, 
when the sample proportions are representative of the population so 
Е(о/)) would be given by Улуа/ М resulting in oy? = а,7 + Dnya?/N. 

Glass and Hakstian showed that the two estimates, е? and 2°, of the 
population relationship would be essentially the same in practice, since 
their relationship can be written as: 


xoa. Mio 

€=0 + ESS 0) 
It can be readily seen as the sample size increases or as the error 
variance decreases, the estimates converge. 

The popularity of Hays’ ô? as opposed to the lack of emphasis given 
to Kelley's е over the years in psychological journals is difficult to ex- 
plain. Glass and Hakstian discussed some problems with the in- 
terpretation of ô? (and by implication e°), which they cited as the 
probable reason for the lack of popularity of e? over the years. Their 
main argument was that these measures depend too much on the 
Specific levels chosen for the independent variable. They postulated 
that although research Workers claim to be dealing with a fixed effects 
model they are actually concerned with their variables as molar con- 
structs and not with the levels actually administered in the experiment. 
It seems to the Present authors that although it is frequently true the 
fixed effects model is used in situations not satisfying all requirements 
for such a model, there are circumstances where the fixed effects model 
15 appropriate. Yet, Glass and Hakstian's warning should be heeded 
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that such measures can be very misleading when used to estimate the 
strength of association for variables where not all levels of interest are 
included in the design. 

It was the purpose of this study to empirically investigate some of 
the characteristics of the sampling distributions of е? and ô? under а 
limited set of conditions. The procedure consisted of taking random 
samples from populations with known correlation ratios and compar- 
ing the estimates so obtained with the true population values. It is also 
hoped that the results will give additional impetus to the discussion of 
the appropriateness of measures of association in the context of fixed 
effects analysis of variance. 


Procedure 
Parameters Relevant to Generation of Treatment Populations 


Since the study was concerned with measures of association in the 
fixed effects analysis of variance context, the first consideration was 
the number of independent variables and the number of levels chosen 
for such variables. [t was decided to carry out all experiments within 
the framework of a three treatment level one-way analysis of variance 
design in order to keep matters simple and also because there appears 
10 be a direct generalization to higher order designs. Five levels of the 
population correlation ratio were used: n? = .00, n? = .05, п? = „15, п? 
= 40, and n? = .75. ый, 

Both Kelley and Hays made use of the assumption of equal within 
treatment population variances in the development of their formulas 
for е? and 27, The homogeneity of variance assumption has been 
Shown not to be of critical importance in the F test, provided equal 
Sample sizes are used (Box, 1954, Norton studies, cited in Lindquist, 
1953). In empirical research this assumption is often violated, and it is 
therefore desirable that the effects of heterogeneous variances on the 
measures used to estimate the strength of relationship be studied. 
Three levels of heterogeneity of variances were used: zero 
heterogeneity (homogeneous variances); slight heterogeneity (ratios 
3:2:1 from largest to smallest variance); and marked heterogeneity 
(ratios 10:4:1 from largest to smallest variance). Since experimental 
treatments generally affect both means and variances such that larger 
Means are usually associated with larger variances (Norton studies, 
Cited in Lindquist, 1953), heterogeneous variances were created ac- 
Cordingly, p 

Table 1 shows for the fixed values of the treatment means the within 
treatment error variances giving the desired population 17 values. All 
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treatment populations were normally distributed with the means and 
variances as shown in Table 1. 


Parameters Relevant to the Sampling Experiments 

A series of experiments were simulated by drawing random samples 
from the treatment populations specified above. Total sample size was 
varied at three levels N = 15, М = 30, and М = 90. Box (1954) showed 
that the violation of homogeneity of variance assumption had little 
effect on the probability of the F statistic exceeding the 5% significance 
point, provided equal sample sizes per treatment condition were used. 
However, unequal sample sizes did seriously affect the robustness of 
the F statistic to violations of homogeneity of variance. A similar effect 
should be found for the estimates of strength of association, е? and å’, 
since they are directly related to F. It can be easily shown that: 

E 
01 ЛЕЙДЕН (10) 
Bish sears, 
"Жый 

where 


N = total sample size. 
J = number of treatment levels. 


and 
(11) 


7-1 
The results of Box were based on the probability of an F exceeding 
some point, while the present study is concerned more with the uen 
оп the specific magnitude of the F statistic and consequently, є and å’, 
The Norton study did look at the effect of heterogeneous variances on 
the magnitude of the F statistic, but only for the case of equal sample 
Sizes. In the present study for each level of total sample size, the within 

| treatment sample sizes were either equal or in the ratio 3: 5:7. For the 
heterogeneous variance conditions, two unequal sample size condi- 
lions were used, so that the largest sample size was associated with the 
largest within treatment variance in one condition, and associated with 
the smallest within treatment variance under the other condition (i.e., 
the ratio was varied from 3:5:7 to 7:5:3). 

In all there were five levels of population т”, three levels of 
heterogeneity of variance, three levels of total sample size, and three 
| levels of equal-unequal sample sizes. Within each cell 1000 simulated 

experiments were carried out with F, €, and â? being computed for 
each. 


"TM 
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In his derivation of &, Kelley states that no systematic error will be 
introduced if the numbers in the arrays for a sample give proportions 
which differ in only a random manner from the population propor- 
tions. However, a word of justification for estimating ô? by use of For- 
mula 6 for the unequal sample size conditions seems warranted since 
this formula was developed for the situation where “ . . . the propor- 
tional representation of cases in the J samples is the same as the 
proportions in the respective populations . . .” (Hays, 1963, р. 382). 
Hays did not suggest an alternative estimate of w? when the propor- 
tional representativeness of the samples fails to match its population 
counterpart. Vaughan and Corballis (1969) although strongly urging 
the use of equal sample sizes did offer a “fair approximation" ap- 
propriate provided samples were approximately equal. They 
developed their argument by noting that in a one-way fixed effects 
model, E(MSs) = o¢? + (n У. 1702)/(7 — 1), where n is the number of 
Observations per cell. In order to estimate the variance between 
groups, ов? = Da,?/J, they used: 


êa = (J — 1) (MSs — MSw)/nJ. (12) 


But for the unequal sample size case, Е(М5в) = в? + Dye та — 
1), so опе cannot simply divide by n as in Equation 12 to obtain an un- 
biased estimate of ов". Vaughan and Corballis instead suggested 
dividing by the mean лу, (n). The ô? then becomes: 


(J — D(MS, = MSy)/nJ 
{O — 105» — М5») + ÀJMSy]/RJ 
SSe — (J — 1)MSy 
$$» + @ — 1)/М8, + М5» 


1 5585 (J — DMSy. (13) 
SSr F MS, 


۵ = 


Thus, the estimate of w? for the unequal sample size case reduces to the 


sei estimate, Equation 6, for the case of proportional representa- 
ion, 


Results 


Means and standard deviations of the obtained 2? and е? were com- 
puted for each of the 120 cells. In Table 2 the means are given as the 
upper numbers with the standard deviations given just below them. 
The means indicate that ô? is slightly negatively biased (Z = —4.31,P 
< .001, using data from equal sample size and homogeneous variance 
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conditions). Any bias in є? is not évident from the data of Table 2. The 
combination of homogeneous variances and unequal и yielded consis- 
tent mean underestimates across all conditions of n? and total N. With 
heterogeneous variances and unequal и, 6? and е were either 
overestimates or underestimates depending on the particular combina- 
tion of unequal n with within cell error variance. When the sample 
with largest п was taken from thegreatment population with largest 
variance, є and ô? substantially underestimated the population 
relationship. However, for the reverse case (largest n from treatment 
population with smallest variance) û and е substantially 
overestimated the population association. It can also be seen that 
heterogeneity of variance had little effect on the estimates under condi- 
tions of equal л. 

Hays did not provide the standard error of û, but Kelley did derive 
an approximation of the standard error of е which he gave as: 


T Е - D + еу” (14) 


= МАМЫ 


According to Kelley this standard error is appropriate unless VN is 
not small in comparison with 1/4/N. As with €, the homogeneity of 
variance assumption was used in deriving Formula 14. If popula- 


tion 17 is substituted for е? in Formula 14, the resulting estimates for 


the conditions of homogeneous variances and equal sample sizes are 


in close agreement with the standard errors for e given in Table 2, but 
are consistently slightly larger. 

The data of Table 2 indicate that both ô? and е? have large standard 
deviations when small samples are used. For instance when n? = .15 or 
40, the standard error of û was consistently close to .20 for М = 15, 
and close to .13 Гог М = 30, and still as large as 107 for N = 90. ^ com- 
parison between à? and є? favors à? as a slightly more efficient es- 


timator than е, since the standard deviations of à? were consistently 


somewhat lower than those of €. Table 2 also shows that as total N in- 


creases the standard deviations of the two indices decreased. Just as 
the means of à? and е? were affected by the particular combination of 
heterogeneity of variance and unequal n, the standard deviations were 


similarly affected. In fact, within each condition of population 17, 
there was a positive relationship between the means and the standard 


deviations such that the largest mean was 


standard deviation. A final observation regar 1 
tions is that the standard errors varied with the degree of the popula- 


tion relationship. When the association in the population was very 
strong (1? = .75) or very small (n? = .00) the standard deviations were 
consistently lower than in any of the other conditions. The estimates 
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were least efficient when the population relationship was in the middle 
range, since the conditions of n? = .15 and .40 had the overall largest 
standard deviations. Using Formula 14 it can be shown the maximum 
standard error of є? occurs at about n? = .28, n? = .31, and т? = .33 for 
М = 15, М = 30, and М = 90 respectively with standard errors of ap- 
proximately .232, .151, and .083. 

In addition to e” and 2°, F was computed for each simulated experi- 
ment and tested for significance. The frequencies of significant F's (not 
given in this paper) lend support to the previously reported negligible 
effect of heterogeneity of variance on F under conditions of equal n 
(Norton studies, cited in Lindquist, 1953). Furthermore, these results 
confirm Box's (1964) derivations of the combined effects of unequal n 
and heterogeneity of variance on the percentage of significant F’s. The 
number of significant F's was underestimated for heterogeneous 
variances and unequal n when the largest variance was associated with 
the largest n. However, for the reverse case (largest variance associated 
with smallest п), the number of significant F’s was overestimated. This 
pattern is directly parallel to that reported for є? and à? in Table 2. 


Discussion 


The purpose of this investigation was to study the sampling distribu- 
tions of ô? and e, measures of strength of relationship, in fixed effects 
analysis of variance designs. The results appear to promote some cau- 
tion on the part of investigators in the interpretation of à? and ¢? with 
small samples. The standard errors of these two statistics appear to be 
sizeable for small samples as indicated by Table 2. However, it should 
be pointed out that the large standard deviations with small samples 
are typical of other correlational indices. 1 

Hays proposed ô? would be useful for two major purposes. One is 
to see if there is a strong relationship present even though a non- 
Significant F was obtained. The second is to test for a trivial 
relationship even though a significant F was obtained due to use of 
large samples. The results reported show that à? and hence е? һауе 
only moderate utility for these purposes. By substitution into equation 
10 it сап be shown that a nonsignificant F (p > .05) could yield an à" as 
large as .277 for М = 15, However sizeable 225 following nonsignifi- 
cant Е tests with small М are quite likely to have come from population 
associations of zero or close to zero due to the large standard errors 
With small samples. So even though one's best point estimate of the 
degree of the population association is close to the obtained û, the 
large Standard errors for à* with small N reduces its utility for 
detecting strong relationships following nonsignificant F tests. 
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However, using ô? or е? to estimate the degree of association following 
a significant F with large N does have strong merit. This increased 
utility is due primarily to the reduced standard errors with large N. 
The mean å? for N = 90 following F tests significant at or beyond the p 
< .001 point did vary across the population 7? values. The utility of 
computing ô? or с? following significant F tests would of course in- 
crease with increases in sample size. 

Summarizing some of the other results, it was found that å? and е? 
were very similar, heterogeneity of variances had negligible effects on 
the estimates under conditions of equal и, combinations of 
heterogeneity of variance and unequal n yielded especially poor es- 
timates, and use of Formula 6 to estimate 4” with unequal n and 
homogeneity of variance resulted in reasonable estimates, but did in- 
crease the negative bias of the estimates. 

One of the purposes of this study was to investigate the effects of 
violating the assumptions under which Ф? and е? were defined, namely, 
homogeneity of variance and proportional representativeness of one’s 
sample sizes. The formulas were not developed for such cases but since 
they probably will be used in such instances the robustness of these 
two statistics under violation of these assumptions should be known. 
We do not mean to imply that the statistic is not appropriate when the 
conditions under which it was defined have been satisfied. For instance 
if one were to assume the sample sizes were proportionally represen- 
tative of the population sizes for the unequal sample size conditions, 
this would alter the population 7? and a little calculation would show 
that the à? were indeed better estimates of this population value. 
However this was not the purpose of our unequal n conditions as we 
wanted to investigate the effects on the estimators when the sample 
sizes were not proportionally representative. 


REFERENCES 


Вох, С. E. P. Some theorems on quadratic forms applied in the study 
of analysis of variance problems, 1. Effect of inequality of 
Eran in the ойе учу ОШ Annals of Mathema 

tatistic i ое 

Glass, pues A. R. Measures of association in com- 
parative experiments; Their development and Пол. 
American Educational Research Journal, 1969, 6, 403-414. e 

Hays, W. L. Statistics for psychologists. New York: Holt, Rinehart, 


and Winston, 1963. } 3 
Kelley, T. ENS unbiased correlation ratio measure. Proceedings of 
_ the National Academy of Sciences, 1935, 21, 554-559. d 
Lindquist, E. F. Design and analysis of experiments in psychology an 
education. Boston: Houghton Mifflin Company, 1953. 


554 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Olkin, I. and Pratt, J. W. Unbiased estimation of certain correlation 
coefficients. Annals of Mathematical Statistics, 1958, 29, 201-211. 

Pearson, K. On the correction necessary for the correlation ratio 7. 
Віотеігіса, 1923, 14, 412-417. 

Vaughan, б. M. and Corballis, M. C. Beyond tests of significance: 
Estimating strength of effects in selected ANOVA designs. 
Psychological Bulletin, 1969, 72, 204-213. 


| 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
1975, 35, 555-566. 


OBTAINING PAIRED COMPARISONS DATA FROM 
MULTIPLE RANK ORDERS USING PARTIALLY 
BALANCED INCOMPLETE BLOCK DESIGNS’ 


RALPH G. STRATON 
University of Sydney 


Complete paired comparisons data was obtained by the method of 
multiple rank order (MRO) in the context of gathering rank order 
preferences of grade six students, their parents, and their teachers for 
instructional objectives. Partially balanced incomplete block designs 
with two associate classes were used in the M RO instruments instead 
of the usual balanced incomplete block designs. 

The use of partially balanced designs may yield several benefits to 
a researcher including a reduction in the number of blocks of stimuli 
to be ranked, a measure of the internal consistency of subject's 
choices, and a concentration of experimental effort upon com- 
parisons of the most critical stimulus pairs. The benefits and the as- 
sociated costs of using these designs are discussed in the light of the 
data obtained in the study. 

ver rank orders or paired com- 


It is recommended that whene' ў р ) 
parisons data is called for in а study that serious consideration be 
given to the use of the MRO method. Furthermore, it 15 suggested 


that the overall purposes of a study may 
partially balanced rather than а balanced 
the MRO method. 


IN many educational and psychological research studies rank order 
data is all that is required of individual subjects. The obtained rank 
orders may hold intrinsic interest, ОГ, if multiple measurements аге 
available they may be used to determine interval scale values of the 
stimuli using the law of comparative judgment (Guilford, 1954). In 
these situations it is usual for either the method of rank order or the 


! The author is indebted to Dr. Jack C. Merwin and Dr. James S. Terwilliger for their 


helpful comments and suggestions. 
Copyright € 1975 by Frederic Kuder 
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method of paired comparisons to be used for data collection. 
However, the method of multiple rank orders provides an alternative 
method which is not often used but which has much to recommend it. 
This paper will focus upon some of the features of the multiple rank 


orders method and will report on its use in collecting rank order. 


preference data using partially balanced incomplete block designs with 
two associate classes, instead of the more usual balanced incomplete 
block designs. 


Method of Multiple Rank Orders 


The method of multiple rank orders was put forward by Gulliksen 
and Tucker (1961) as a means of obtaining paired comparisons data 
with a reduction in experimental labor. Their procedure followed an 
earlier suggestion by Durbin (1951) for reducing the experimental 
labor and biassing order effects of the method of rank order. Durbin 
suggested obtaining rankings within subsets of the stimuli or blocks 
rather than all together. For example, in Table 1, seven stimuli are 
presented in seven blocks of three stimuli each. 

In a sense, the three methods of rank order, multiple rank orders, 
and paired comparisons form a continuum of ranking methods. In the 
tank order method all “л” stimuli are ranked together in a single set or 
block. Thus, if “b” is the number of blocks and “k” is the number of 
stimuli within a block we have b = 1 and k = n. In paired comparisons 
each stimulus is paired with every other stimulus to form b = n(n — 
1)/2 blocks with the k — 2 stimuli being “ranked” within each block. 
The multiple rank orders method lies between these extremes, having | 
< < n(n — 1)/2 blocks with 2 < К < n stimuli within each block. The 
rankings given to the stimuli within blocks may allow preferences 
between the members of each stimulus pair to be deduced, thus 
yielding paired comparisons data. 


TABLE 1 
Balanced Incomplete Block Design for Seven Stimuli 
BI 
Боқ Stimuli 

1 2 3 
: 1 4 5 
Я | 7 1 6 
А 4 2 6 
р 5 7 2 
2 3 4 7 

6 3 5 


ж. 
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Allocation of Stimuli to Blocks 


Data collection procedures which use the method of multiple rank 
orders differ not only in the total number of stimuli (n) but also in the 
number of blocks (0), the number of stimuli within each block (k), and 
the conditions governing the allocation of stimuli to blocks. Together 
these factors make up the experimental design which is followed in al- 
locating stimuli to blocks. It will be apparent that the choice of design 
is critical in that it will determine what comparisons are made and 
hence the data yielded from the study and the inferences which may be 
made. 

In his original paper Durbin (1951) suggested that the stimuli be al- 
located to blocks so as to conform to a balanced incomplete block 
design (see Cochran and Cox, 1957). In making this suggestion he was 
guided by two conditions: *(a) Each object should occur an equal 
number of times in the experiment as a whole. (b) The number of times 
two particular objects occur together in the same block should be the 
same for all possible pairs of objects (P. 85)." These conditions also led 
Gulliksen and Tucker (1961) to restrict themselves to balanced in- 
complete block designs in suggesting the use of multiple rank orders 
for collecting paired comparisons data. It will be apparent that the ar- 
rangement of stimuli in Table 1 conforms to a balanced incomplete 
block design. 

The constraints imposed by these conditions may be an advantage 
Or even a requirement in some studies. However, this is not always the 
case and other designs may be more advantageous in certain circum- 
stances. Coombs (1964), in recognizing this fact, has remarked that: 
". . , there is nothing sacred about incomplete block designs for col- 
lecting data being balanced—and an unbalanced or partially balanced 
incomplete block design might be considered. . . . The fact that various 
stimuli and combinations of stimuli would be presented à different 
number of times (unbalanced) might even become а virtue if more in- 
formation is needed on some pairs than others, as indeed is usually the 
case (p. 346)." Bock and Jones (1968) and Dykstra (1960) have dis- 
cussed the analysis of certain partially balanced designs. 


Attributes of the Multiple Rank Orders Method 


e method of multiple rank orders is the 
reduction in experimental labor which it affords. Gulliksen and 
Tucker (1961) have estimated that “Бог twenty or thirty stimuli the 
(multiple) rank order design takes, for each subject, only one-half to 
one-fourth of the time required for the complete paired comparisons 


A major advantage of th 
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(p. 174).” Furthermore, the complexity of the ranking tasks can be 
kept low by putting restrictions on k, the number of stimuli in a block. 

Multiple rank orders also allow departures from transitivity to be 
determined. For paired comparisons data departures from transitivity 
may be determined for any triad of the stimuli. With multiple rank 
orders, however, certain triads cannot be circular because the stimuli 
appear together in the same block, i.e., there is forced transitivity 
within blocks. Nevertheless, Kendall's (1955) “coefficient of con- 
sistence,” zeta, may still be used as an index of transitivity in many 
cases. His formulae will not be appropriate for all multiple rank orders 
designs, however, due to the constraints which these designs place 
upon the data. Gulliksen and Tucker (1961) assert that the formulae 
are always applicable when balanced incomplete block designs are 
used and Straton (1971) has demonstrated their applicability to certain 
partially balanced designs. 

The channel capacity of the multiple rank orders method, although 
less than that of paired comparisons, is considerably greater than that 
of the rank order method for many designs (Gulliksen and Tucker, 
1961; Coombs, 1964). According to Coombs “Channel capacity in- 
dicates how much information a method might carry and thereby 
provides a measure of relative power (1964, p. 34)." 


The Use of Partially Balanced 
Incomplete Block Designs 


In many studies, where the method of multiple rank orders may be 
suitable, the use of a balanced incomplete block design will not be ap- 
propriate, This may be due to the fact that there is no balanced design 
available for the number of stimuli specified in the study. However, 
another difficulty is that the number of blocks in a balanced design i5 
always at least as great as the number of stimuli. Thus, experimental 
labor may be above tolerable limits. Both of these considerations led 
to the choice of a partially balanced incomplete block design for use in 
а study recently completed by the author (Straton, 1971). 


Nature of the Study 


_ The study was concerned with the rank order preferences of grade 
ix students, their parents, and their teachers for science instructional 
objectives. Objectives of two different levels of generality were used: 
The characteristics of the subjects, the nature of the stimuli, and the 
design of the study all necessitated that the two data collection 
methods used should be simple, straightforward, and capable of being 


ж 
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completed quickly. The method of rank order was chosen as one 
method. For the second method both paired comparisons and multi- 
ple rank orders using a balanced incomplete block design were ruled 
out as pilot study subjects found the methods to be too boring and too 
time-consuming. Furthermore, no balanced incomplete block design 
was available for use with fourteen stimuli, the number to be used in 
the study. 

A suitable design was located, however, in the extensive tables of 
partially balanced incomplete block designs with two associate classes 
given in Bose, Clatworthy and Shrikhande (1954). These designs 
violate the second of Durbin's (1951) conditions, i.e., some of the pos- 
sible pairs of stimuli appear together in the same block more often 
than other pairs. For 14 objectives there are 91 possible pairs. In the 
design chosen, seven of these pairs occurred together in the same block 
three times while all other possible pairs appeared only once (see Table 
2). However, each stimulus appeared only three times in the design as 
a whole, thus conforming to the first of Durbin's (1951) conditions. 

The chosen design had several important attributes. First, only 
seven blocks were required, with six stimuli in a block. A balanced 
design would have required at least twice as many blocks. Second, the 
seven replicated stimulus pairs occurred together with four different 
stimuli in each of three blocks. Thus, each pair was associated with 
every other stimulus once. Third, each stimulus was included in one 
and only one replicated pair. Fourth, these replicated stimulus pairs 
allowed an estimate to be made of the internal consistency of the 
responses of individual subjects. Many of the designs to be found in 
Bose, Clatworthy and Shrikhande (1954) have similar attributes. 


Determining Ranks from Multiple Rank Orders Data 


the raw data yielded from 


Sets of rank orders, one per block, were у | fro 
The implicit pairwise 


the multiple rank orders instruments. 


TABLE 2 
Design Used for Multiple Rank Orders Instruments 


; 2 9 4 il 
2 3 9 3 10 5 12 
2 3 10 n 11 6 13 
| 4 11 5 12 7 " 
2 5 12 6 13 1 
6 6 13 7 14 2 

i 1 1 8 3 10 
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preferences among the stimuli were deduced from these rank orders 
and recorded in a "Matrix of Votes for Complete Data." In this 
matrix a one in a cell meant that the column stimulus was preferred 
over the row stimulus. А zero meant that the row stimulus was pre- 
ferred. Thus, the sum of the entries in the (i, /)th and the (j, i)th cells 
was equal to the number of times the stimuli appeared together in a 
block in the whole design. For partially balanced designs this sum is 
not necessarily unity, whereas for balanced designs it would be. 

The “Complete” matrix was converted to a “Matrix of Votes for 
Paired Comparisons Data" by converting all cells to ones and zeros 
(see Figure 1). This was done by finding the mean of the (i, /)th and the 
(j, ith cell entries and scoring one for the cell whose entry was greater 
than this mean, and zero for the cell whose entry was less than this 
mean. The final obtained rank order of the stimuli was based upon the 
column vote totals of the “Paired Comparisons” matrix. 


Test-Retest Reliability 


Tau coefficients were used as an index of agreement between rank 
orders obtained from individual subjects. The mean tau coefficient was 
used as an index of test-retest reliability using a one- to two-week in- 
terval. Table 3 gives these mean tau values for the three rater groups 
and two levels of objectives. The proportion of each group for which 7 
2 ‘50 is also shown. The probability that random responses would 
yield ат > .50 is p < .012 according to the sampling distribution of tau 
generated by Monte Carlo methods for this study. 


Internal Consistency Reliability 


The internal consistency reliability of a subject's preference for one 
of a pair of stimuli may be estimated from those pairs which are 
Presented to the subject more than once. It was assumed that the sub- 
ject had a “true” preference for one of each pair of stimuli and that 
this preference did not change during the course of the administration 
of the instrument. Thus, each time he failed to choose the рг eferred 
stimulus he made an error and his response showed unreliability. The 
more frequently chosen stimulus of a pair was considered to be the 
truly preferred stimulus. Each choice was scored one or zero (choice 
score) depending on whether the truly preferred stimulus was chosen 
on that replication, 

An index of internal consistency reliability, gamma (y), was defined 
as one minus the ratio of the observed error variance for a subject to 
the maximum possible error variance, i.e.: 
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TABLE 3 


Test-Retest Reliability: Mean Tau Values for Rank 
Order and Multiple Rank Orders Data 


Group Rank Order Multiple Rank Orders 
Tau Proportion Tau Proportion 
Mean S.D. T2.50 Mean S.D. 72.50 
Level I 
Student 474 .264 611 610 .163 .833 
Teacher .654 .287 .611 .790 .182 889 
Parent 476 .285 .667 .564 199 :556 
Level II 
Student :549 .304 778 .600 214 1778 
Теасһег .639 .267 .833 .628 .264 833 
Parent .549 219 667 .655 .268 122 


Note.—N = 18 for each cell, 


Error Variance ) 
Error Variance (Max.) 
For each replicated stimulus pair the variance of the choice scores was 
obtained and then these variances were summed across pairs to yield 
the Error Variance term. The Error Variance (Max.) term was 00- 
tained in a similar manner except that the choice scores used were 
those which would result from the most inconsistent response pattern 
Possible. For three replications this would mean two ones and one 
zero and for five replications three ones and two zeros. 

\ Gamma values were calculated so as to yield the reliability of a 
single choice between stimulus pairs for a single subject. The reliability 
of the total votes for a single stimulus (see Figure 1), upon which the 
final rank order depends, can be determined by applying the 
Spearman-Brown formula to the gamma values. This index was called 
gamma prime (у). In the present study 13 choices contributed to each 
Vote total and so the factor 13 was used in the Spearman-Brown for- 
mula to obtain gamma prime values. 

_ Table 4 gives the mean pretest values of gamma prime. The propor- 
tion of each group for which y' > .915 is also shown. The probability 
that Tandom responses would yield a gamma prime value of .915 OF 
more is p < .014 according to the sampling distribution generated by 
Monte Carlo methods for this study. 


y=1- 


Relationship between Zeta and Gamma 


_ Both the index of transitivity, Kendall's (1955) zeta, and the index of 
internal consistency reliability, gamma, may be viewed as indices of ү 
ternal consistency. In this situation one is forced to consider to whe 
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TABLE 4 


Internal Consistency Reliability: Mean Gamma Prime Values 
for Multiple Rank Orders Pretest Data 


Group Gamma Prime Proportion 
Mean S.D. ү’ 2 915 
Level I 
Student 872 102 .500 
Teacher 966 048 .889 
Parent .906 084 639 
Level П 
Student 935 051 806 
Теасһег 976 035 .972 
Prent (943 .058 .861 
Note.—N = 36 for each cell. . 


extent these indices are related, and this was estimated using the 
Pearson product moment correlation coefficient, r (see Table 5). 

In spite of the statistical significance of the correlation coefficients 
(using а = .05), in no case is more than 40% of the variance accounted 
for. Thus, it seems to be worthwhile to attempt to estimate both types 
of inconsistency. This is not possible using balanced designs unless the 
design is replicated, resulting in greater experimental labor. 


Agreement Across Methods 


The index of agreement used was the mean of the tau coefficients 
calculated between rank orders obtained by the two methods, rank 
order and multiple rank orders, for each subject. These mean taus 
were calculated for the pretest data and are presented in Table 6 
together with the proportion of taus within each subject group for 


TABLE 5 


Correlation between Zeta and Gamma Values for 
Multiple Rank Orders Pretest Data 
а Se 
Group Pearson r p-value 
Зои. НА ааа а 
Level I 


Student .630 p< 01 
Teacher 49 и 5 M 
arent 5 я 
Level 11 
Student 395 сас 05 
Теасһег 367 {02 <p < 1 
Parent .518 р< 01 


No 2 BE EE E утла sociated with a two-tailed test of the significance of the 
difference of r from zero, 
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which т > .50. The probability that random responses would yield a D 
> .50 is p < .005 according to the sampling distribution of tau 
generated by Monte Carlo methods for this study. р 
Tau values were also calculated across both methods and testing ses- 
sions, i.e., allowing two, not just one, main sources of error to enter 
the situation (see Table 6). There seems to be little doubt that the two 
methods obtained essentially the same rank orders for most subjects. 


Conclusions 


When only rank order data is required from individual subjects the 
method of multiple rank orders has much to recommend it. It can al- 
low great reductions in time and in experimental labor compared with 
the method of paired comparisons. It can also reduce the complexity 
of a subject's task, compared with the method of rank order, pat- | 
ticularly when the number of stimuli becomes large. For some types of — 
subjects or for some types of stimuli this might mean as few as 14 
stimuli. Since a wide range of possible designs for data collection can 
be used with the method and since indices of transitivity and internal 
consistency are available for many of these designs, the method of 
multiple rank orders would appear to be the preferred method of data | 
collection in many situations. | 

The empirical data reported here, for three subject types and two 
types of stimuli, is also favorable to the multiple rank orders method. | 
This method yielded rank orders in close agreement with those ob- 
tained by the method of rank order, but with generally greater test- 
retest reliability. Internal consistency reliability was also gener ally 
high. Thus, it is surprising that only a few studies using the method of 


TABLE 6 
Agreement between Methods: Mean Tau Values for Pretest 
Data and Reliability Data 
ر‎ 


Group Pretest Data Reliability Data |, 
Tau Proportion Tau Proportion 
Mean — S.D. т>.50 Меп SD. 12.50 
Levell 
Student 581 249 611 
: 3 860 5020 2298 4 
ДОШ р 1168 972 74 243 m 
NDS г 193 667 540 219 : 
Student 660 131 750 
` i .889 50 252 А 
Teacher 820 10 1.000 615 259 833 
Parent | 910 230 722 (568 251 72 
Моге. 


ТМ = 36 for each cell of the Pretest data and N = 18 for each cell of the Reliability data. 
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multiple rank orders have been located in the educational and psy- 
chological research literature. АП except one of these studies used a 
balanced incomplete block design (e.g., see Borgen and Weiss, 1968; 
Terwilliger, 1963). Only McKeon (1960, 1961) reports having used a 
partially balanced incomplete block design with two associate classes 
as was used in the present study. However, in McKeon’s design one of 
the classes was null whereas in the present study all stimulus pairs ap- 
peared together in the same block at least once. 

There are, of course, costs involved in the use of the method of mul- 
tiple rank orders. Channel capacity is less than for the method of 
paired comparisons and the transitivity of certain triads of the stimuli 
cannot be determined. The use of a partially balanced or unbalanced 
design carries the further risk of loss of experimental independence in 
the replications, but this must be offset against the the possibility of 
concentrating experimental effort upon comparisons of the most 
critical stimulus pairs. 

It is recommended that whenever rank order or paired comparisons 
data is called for in a study that serious consideration be given to the 
use of the method of multiple rank orders. Furthermore, it is suggested 
that the overall purposes of a study may best be served by the use of a 
design that is partially balanced or unbalanced, rather than a balanced 
incomplete block design. 
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ESTIMATING MOMENTS OF UNIVERSE SCORES 
AND ASSOCIATED STANDARD ERRORS IN 
MULTIPLE MATRIX SAMPLING FOR ALL 
ITEM-SCORING PROCEDURES 


ТЕ) М. PANDEY? AND DAVID M. SHOEMAKER 
Southwest Regional Laboratory for Educational Research and Development 


Described herein are formulas and computational procedures for 
estimating the mean and second through fourth central moments of 
universe scores through multiple matrix sampling. Additionally, 
procedures are given for approximating the standard error as- 
sociated with each estimate. All procedures are applicable when 
items are scored either dichotomously or polychotomously. 


—" s.c. "АКМ . "лы —. 


| 
| MULTIPLE matrix sampling is a statistical procedure іп which a set 
© of K items (referred to as the item universe) is subdivided into / 
| subtests containing К items each with each subtest administered ton 
"examinees selected randomly from the population of N examinees. 
_ Although each examinee tested is administered only a portion of the 

K items, the results from each subtest may be used to estimate the 

statistics of the universe scores which would have been obtained by 

administering all К items to all М examinees. The advantages of 

multiple matrix sampling over traditional testing procedures in the 
estimation of group performance are numerous, have been cited else- 
| Where (e.g., Osburn, 1967; Lord and Novick, 1968; Shoemaker, 1972), 
| апа need not be enumerated here. Of primary concern 15 the fact that, 
_ With few exceptions, computational formulas available currently in 
Multiple matrix sampling assume that individual test items are scored 
dichotomously. A restriction such as this is relatively minor in the 
area of achievement testing for the simple reason that here items are 
typically scored dichotomously. However, if multiple matrix sampling 
| isto be applicable to other measurement instruments or to other item- 
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scoring procedures, more general equations must be made available. 
Such has been our goal and the reader will find here computational 
formulas for estimating moments of universe scores which may be 
used when items are scored either dichotomously or polychotomously. 
Estimates of higher central moments of universe scores are required 
if multiple matrix sampling is used to approximate the entire frequency 
distribution of universe scores. Additionally, procedures for estimating 
the standard error of the pooled estimate of the mean universe score 
and pooled estimate of the second through fourth central moment 
of universe scores when items are scored polychotomously are given. 


Estimating Moments of Universe Scores 


The equations given here for estimating the moments of universe 
scores have been derived within the mathematical framework devised 
originally by Hooke (1954, 1956a, 1956b) and Lord (1960). Although 
we report only final results here, a discussion of Hooke's methodology 


and an expanded version of our analysis is available (Pandey and 
Shoemaker, 1973). 


Assumptions and Notation 


We define a population matrix Х = | х,| for I = 1,2, -++ ,N ex 
aminees and J = 1, 2, ... , К items where хуу denotes the score ob- 
tained by examinee / on item J. A matrix sample (bisample) taken 
from the population matrix is denoted as X = ||х/| for i = 1, 2, 

+, п examinees andj = 1, 2, ... ‚К items. We assume the n eX” 
aminees and k items in the bisample are a random sample from the N 
examinee population and the K-item universe. We assume that the 
item-scoring procedure involves a continuous scale and is applied uni- 
formly to each item. Scoring items dichotomously or polychotomously 
are special cases within our framework. We define an examinee’s 
universe score as the sum of his K item scores. 


Parameters Estimated 


We seek unbiased estimators of the mean universe score and the sec 
ond through fourth central moments of universe scores using the dat 
from one matrix sample. Denoting the universe score for examinee Таз 
хі, We define the mean universe score as 


N 
MM Ех. (D 


um کل‎ 


N 
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and the rth central moment of universe scores as 


N 
2; Gro шу 
zx 258 . 


нет N Q) 


Derived Equations 


Equations are given below for estimating the moments when both М 
and K are finite. Because the population is frequently very large, we 
give additionally equations for estimating the second through fourth 
central moments when N is infinite. Following these equations, we 
describe the procedure, used in computing the D's, A's, and F's given in 
our equations. 


Mean Universe Scores: 


VN EO [9 
h m nk i=l à 
Second Central Moment: р, 
— 1 n 
y = QU Dp, 2) к (4) 
А = b». + в к (4а) 
Third Central Moment: 
A A 
Fourth Central Moment: 
— 1 Ел ү 2Ёв ү АЁв 1 Far 
№ = paci D URL EL E + ra 
Е N — JN? — 3N — 3) 2E 6 


SEU xA ES C ER 
А+ |] 


ù = bz. ++ LL + 6Fıs 425,4) 
Noo 
+ =, (6Е,; + 122 + ЗЕ + AFai) (62) 


К? 


F + (ЗЕ + kJ) к 
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TABLE 1 
Sigmas (Z's) Associated with Table 2 

2 Arithmetic Computation 

1 (44)? 
һ 

2 Ух 
i=l 
к 

3 3» 
т 
n k 

4 У Dx 


11 }-1 


Computational Procedures 


The procedure for computing the D’s, A’s, and F s is described here 
and it should be noted that doing so is not a casual undertaking on a 
desk calculator. The reader should anticipate using an electronic com- 
puter. Although our tables may seem cumbersome initially, they are in 
a form which is computerized easily. 

Three steps are required in calculating the D’s, A’s, F’s and each will 
be described in detail for calculating the D’s. The same procedure is 
used to compute the A's and F's. Given the matrix sample X — [ 
the following three steps are used to compute any D: 


Step 1: Calculate the sums indicated in Table 1. The plus sign (+) 
given as a subscript denotes that the subscript replaced by 
the + is summed over all values. 

Step 2: Using the sums calculated from Table 1, calculate the d- 
statistics using the coefficients and constants given in Table 
2. For example, 


1 
а-а OD) + (X22 + 000 


where КИ = k(k — Dk - 2)... (k — i + 1). 


TABLE 2 
Conversion of Sigmas (3's) to d-Statistics 
Multiplicative 2 
4 Constant 1 2 3 4 
1 Ти) 1 1 
- cad 
2 l/(n кз!) 1 0 =] 
2 l/(n"k у 1 =i 


Ма к) 1 
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TABLE 3 
Conversion Table for Computing D's from d's 


> 
N 
w 
ы 


Une 
| 


Step 3: Using the d-statistics from Table 2, calculate the D-statistics 
using the coefficients given in Table 3. For example, 


D, = (-1Ха) + (0)(4) + (1)(%). 
The same procedure is used to calculate the A's and F's. The A's are 


calculated using Tables 4, 5, and 6; the F's, Tables 7, 8, and 9. The 
summations in Table 4 and 7 require two additional matrices Y and Z 


where Y = || = [|х/|ап& Z = 1241 = xl: 


TABLE 4 
Sigmas (275) Associated with Table 5 


z Arithmetic Computation 
1 (н)? 
2 (нь) 


=1 


3 (5 хк.) 


rl 
^ 
4 Dx? 
т 
В 
5 Ух 
LI 
a А 


6 У Ххихехе 


DI 


7 pras) 


H 
8 Y» 


x 
9 X»aoxa 


ра 
mo^ 


10 922 Ух” 
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TABLE 5 
Conversion of Sigmas (X's) to a-Statistics 


t4 


Multiplicative 
Constant 1 2 3 4 5 6 à 8 9 


ШЫЛЫ) Е: 2 2 
ШЫСЫ) N 70. .- 
Мик) 1 0 -i 

1/(п Kk?) 1 0 
(так) 1 
1/(n?ki2!) 
1/(n'2k2) 
1/(л Кї!) 1 0 
(пк) 1 
l/(n k) 


Calculating Pooled Estimates of Parameters 


The estimators defined by equations (3) through (6) produce es- 
timates of parameters using the results of one subtest or matrix sample. 
In multiple matrix sampling there аге / subtests and, because of this, / 
estimates of each parameter are obtained. Combining or pooling these 
t estimates of each parameter into a single estimate is accomplished by 


t 
2-0... 


p m ig ڪھ‎ 9 
2,0. 


where О, = n,k,, the number of observations acquired by subtest 5. 


Calculating Standard Errors of Pooled Estimates 


Computing the pooled estimates of each parameter is solving only 
half the problem. Of equal importance is estimating the standard error 


TABLE 6 
Conversion Table for Computing A's from a's 
е ЕЕ vui mendi НН 


А 1 2 3 4 5 6 FERE мо 
с 6 7 8 9 — 

1 1 

2 -1 1 

3 0 1 

4 2. 3 0 1 

5 2 Oo - 0 1 

8 Liles a) Бы МЫ 0 0 1 

| и 0 0 0 1 

12 3 22-І ОЕ STÎ 
Ато) 2 3 m (ecc. 1 
I Ж аза ушга, 2 2 (amie ines Кас ie 
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TABLE 7 
Sigmas (Z's) Associated with Table 8 
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X Arithmetic Computation 
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of estimate associated with this value. Two 
available for estimating the standard error 
procedure is one described originally by 
uses results reported by Hooke (1954, 195 
ferred to as the “jackknife” procedure, 
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general procedures are 
of estimate. The first 
Lord and Novick (1968) and 
6a, 1956b); the second, re- 
has been popularized by 
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TABLE 8 (Continued) 
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Mosteller and Tukey (1968). Although both procedures may be used 
when items are scored polychotomously, the first is applicable only for 
those item-sampling plans having the product rk less than or equal to 
К. The jackknife procedure may be used with all item-sampling plans. 


The Hooke-Lord-Novick Equations 


If items and examinees have been sampled randomly, the squared 
standard error of estimate of the mean universe score for subtest s is 
equal to p (K — кур 

Р, 2 — Ke | 2 

+ = |- 26р ف‎ | ра, 8 

va on.) = |а жаа 9 

Equation (8) estimates the error variance associated with the estimate 

of the mean universe score obtained from one subtest. The estimate of 

the standard error of the pooled estimate of the mean, however, is a 
function of equation (8) and is computed as 


«= DE У ё, 
1 өті 


4 1/2 
SE (и) = 7 | VAR (5 - жо. iil 


where 4,2 refers to the estimate of the variance of the mean item 
scores (variance of the item difficulty indices) for subtest s. 

The squared standard error of estimate for the variance of universe 
Scores for subtest s is equal to 


2F, Е 4n, F, 2n, Fi; 
Vm с ттш t ke DEED -D 
4Fis 4Fn 2 
+ а. =) Ра EG. DOLE D 
2 Ехо 
Mrs z] x 


(10) 


The standard error of the pooled estimate of the variance is computed 

T ге О 1/2 

SE (а) = 1 (5 VAR (à,) — (t — 1)4K E VAR вы, (11) 

ee t Wat m 

where VAR (5».), refers to the variance of the covariances between 

each item and the mean item score for subtest 5. It should be noted 

that equations (9) and (11) were derived under the assumption that the 

number of examinees in the population is infinitely large. ў 
Equations comparable to (8) and (10) could be derived for the third 

and fourth central moments within the framework developed by 
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Hooke; however, doing this would not be a casual undertaking. In 
Hooke's procedure, estimating the standard error of the ith moment 
requires terms related to the 2ith moment and the number of sums, 
i.e., Zs in Tables 1, 4, and 7, becomes large. For example, for the 5th, 
6th, 7th and 8th moments, 91, 298, 910, and 3017 sums, respectively, 
must be computed. Were one to pursue this line of research, the work 
of Dayhoff (1964, 1966) will be of use. 


The Jackknife Procedure 


The jackknife procedure provides an alternative procedure for com- 
puting standard errors of pooled estimates in multiple matrix sam- 
pling: The jackknife procedure could be used to compute the standard 
errors of the pooled estimates of the third and fourth central moments 
after the results from each subtest had been collected. The computa- 
tions involved in the jackknife are relatively simple. Let 


Yan = the pooled estimate of the parameter using all subtests, and 
Ур = the pooled estimate of the parameter computed after remov- 
ing subtest j. 


Defining 
J*y = Yan = (t7 Dy for )-1,2,...,! 
the jackknifed estimate of the parameter is equal to 


у = Ot у +... + у (12) 
with an estimate of its variance given by 
t 
Ue Уо". – »*y d» 
(2,4) = Eum) С 


Shoemaker (1973) has verified empirically that the jackknife procedure 
may be used to approximate standard errors of estimate in multiple 
matrix sampling. It should be noted, however, that when the variance 
of the item difficulty indices is greater than zero, the jackknife 


procedure estimates conservatively the standard error of the mean-uni- _ 
verse score. Р 


Conclusion 


If multiple matrix sampling is to be used more widely, cO, 
putational formulas which incorporate all uniform item-scorin8 
Procedures must be available and in a form easy to compute. ОШ |; 
results are a step in this direction. Although at first glance the tables 
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may seem cumbersome to use, they are in a form which lends itself 
readily to being programmed on a computer. 
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THE STRUCTURE OF DOMAIN HIERARCHIES 
FOUND WITHIN A DOMAIN REFERENCED 
TESTING SYSTEM' 


GEORGE B. MACREADY 
University of Maryland 


The purpose of this study was to assess conditional states of item 
mastery found among items from different item domains and the 
effectiveness of various procedures for identifying such conditional 
relations. The item domains considered were from the curriculum 
area of multiplication of whole numbers, and were defined by a do- 
main referenced testing system. It was possible to infer from the 
results of this study that the domain referenced testing system con- 
sidered produced items which across domains showed strong con- 
ditional relations. Comparisons of goodness of fit were made among 
domain hierarchies with similar numbers of specified conditional 
relations generated by two different empirical procedures and by 
experts judgment. Additional comparisons were made among 
models generated by the same procedure but with different numbers 


of specified conditional relations. ў o 

Support for the validity of empirically generated hierarchies with 
moderate numbers of conditional relations among domains was 
provided. However, similar support was not provided for the con- 


ditional relations hypothesized to exist by subject matter experts. 


Оме kind of information that may prove to be useful to the educator 


in making decisions about the implementation and improvement ofan 
instructional design deals with the nature of the underlying structure 
of the subject matter (e.g. Resnick and Wang, 1969). This structure 
deals with the order of acquisition found among specified intellectual 
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capabilities and is described as a “hierarchy” among these skills 
(Gagne, 1962; Glaser and Nitko, 1971). The specifications for such 
hierarchies may vary from one extreme called a linear ordered set of 
variables, which imposes maximum restrictions regarding order, to the 
other extreme called a completely independent set of variables, in 
which no restrictions are imposed on the order of the variables. 

In education today practitioners either explicitly or implicitly make 
assumptions about the underlying structure of the subject matter. 
However, they seldom go on to empirically verify or modify their con- 
ceptions of the subject matter. Instead, much of the strengthening or 
modifying of their original conceptions are based on an intuitively 
guided process. Such procedures may lead to rigid, overly simplified, 
and stilted conceptions of the subject matter. 

Empirical study in this area is difficult, primarily since the subject- 
matter areas are not seen as being readily assessable to such analysis 
because of a lack of clarity regarding content. One means of dealing 
with the lack of clarity of content found in many educational achieve- 
ment tests is to use a domain referenced approach to testing. Follow- 
ing Hively's (1970) conceptualization, the universe of items which is to 
be considered is specified in operational terms by means of rules called 
item form rules. Specific sets of these rules are used to specify within 
which of a number of possible subsets each item contained in the uni- 
verse falls. The subsets of items are called domains. 

In the specification of the item form rules an attempt is made to 
place items within a given domain in such a way that each item within 
the domain is testing the same underlying skills. It is then possible to 
describe a set of logical relations among the various domains of items. 
Such a set of relations is called a domain hierarchy. 


Method 
Instrument Construction, Administration and Scoring 


The domain referenced test used in this study was the section of 
Honeywell's "Arithmetic Test Generation Program" (ATG), dealing 
у LEE of whole numbers (see Patterson and Vierling, 

The ATG program has grouped multiplication items into 20 non- 
overlapping domains. The six domains used in the study were chosen 
on the basis of pilot study results. 

In the pilot study, a multiplication test was administered to à total 
of 115 fifth grade students. The test was composed of one randomly 
generated item from each of the 20 domains provided by the ATG 
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Program. To obtain a more representative sample of the items within 
each domain, three different forms of the pilot test composed of 
different randomly generated items were administered to one-third of 
the students. In the selection of the six domains that were considered 
in the main study, operationally defined selection rules were used such 
that the domains selected had average item difficulties uniformly dis- 
tributed between .3 and .7. This was done to obtain a large amount of 
subject variability on items within the domains (see Macready and 
Merwin, 1973) and to provide an opportunity to identify a wide assort- 
ment of hierarchies which might exist among domains. A description 
of the characteristic skills involved in working items from the six 
selected domains are listed in Table 1. 

For use in the main study, ten items were randomly sampled from 
each of the six selected domains. The 60 items thus generated made up 
the specific content of the test used. This test was administered to 285 
students in 10 fifth-grade classrooms in the Minneapolis public 
schools. The items on the test were then dichotomously scored as 


either right or wrong. 


Generation of Hierarchical Structures 


The generation of hypothetical hierarchies among the domains con- 
sidered in this study were carried out in an attempt to reflect ordered 
relations in the acquisition of skills necessary to work the items from 
the various domains. To generate such hierarchies, both “theoretical 
and "empirical" procedures were used. 


First, a theoretically generated hierarchy was considered. This 


Item Form Rules and preces Found in the Domains Studied" 

nan Form Reer e и 
ow di [s PU Characteristic Skill 

10 824 A  2digit multiplier; no carry 

12 Eo ха 2 digit multiplier; multiple of 10 

E = ха 2 digit multiplier; easy carry 

15 iat ха 2 digit multiplier; hard carry 

17 Ed m 3 digit multiplier with middle digit equal to 0. 

{8 m ха 3 digit multiplier with no digits equal to 0. 

X361 xB 


mE ду (1972). 
"The ilem form rules used to define the items found within each domain are presented by Macready (1972). 
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hierarchy was based on mathematics teachers' professional judgment 
as to what preliminary skills are necessary prerequisites for the learn- 
ing of more advanced skills. The specific hierarchy considered was 
from the ATG manual (see Patterson and Vierling, 1970) and involved 
the 20-item domains found in the area of “multiplication of whole 
numbers." For the purposes of this study, only that portion of the 
hierarchy which dealt with the domains used was considered. This por- 
tion of the hierarchy is schematically represented in Figure 1. Such a 
schematic representation of a hierarchy may be interpreted in the fol- 
lowing manner. The numbers to the right of each small circle are the 
identification numbers for the domains which the small circles repre- 
sent. The lines connecting the small circles represent conditional rela- 
tions that hold between the domains whose representative circles are 
connected. The domain represented by the lower of the two connected 
circles is considered to be a necessary prerequisite to the domain 
represented by the higher circle. Thus, the acquisition of skills neces- 
sary for correctly working the items from a lower level domain is con- 
Sidered necessary but not sufficient for the acquisition of the skills 
necessary for correctly working the items from a connected higher 
level domain. Another important characteristic of these schematic 
structures is that the conditional relations which are represented are 
transitive, Schematically, this means that any two circles which are in- 
directly connected by a continuously increasing or decreasing set of 
line segments are considered to have a conditional relation existing 
between their domains, such that the higher of the two is considered 
conditional on the acquisition of the lower. In the hierarchy 
represented above this means that both domains 10 and 13 are con- 
sidered to be conditional prerequisites for the acquisition of domains 
12, 15, and 18, while for domain 17, the conditional prerequisites are 
considered to be domains 10, 12, and 13. 


18 
15 
17 Figure 1. Theoretically generated 
hierarchy based on mathematicians 
2 professional judgment. 


10 
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In addition to the theoretical hierarchy presented above, a number 
of empirically generated hierarchies based on the pilot study data were 
also considered. There were two methods used in generating these 
hierarchies. 

The first method was a slight modification of a procedure suggested 
by Carroll and described by Resnick and Wang (1969). The procedure 
determines the ordered relations to be considered on the basis of item 
indices of homogeneity, Ни, found between items. The statistic, Hy), 
estimates the proportion of possible increase in the probability of get- 
ting item i correct, given that a more difficult item / is answered cor- 
rectly. This Statistic is equivalent to Phi/Phi max, which was used by 
Carroll. 

Since in the case of the pilot data only one item from each domain 
was administered to each student. The procedure used consisted of 
identifying the Ни indices for all pairs of items and then specifying the 
conditional relations which were to be considered in the hierarchy on 
the basis of the magnitude of these indices. This was done by taking 
the К largest coefficients of Ни, and specifying conditional relations 
between the domains involved, such that the more difficult domain in 
the pair was considered conditional on the less difficult domain. One 
exception to the above procedure was that a conditional relation was 
not considered unless it showed transitivity with the other conditional 
relations being considered. This situation occurred only once. This 
was in the case of the “Ни” generated model based on 10 specified 
conditional relations, To obtain transitivity in this model, the nontran- 
sitive Ни value was replaced (1.е., the 10 largest transitive Ни values 
were used for generating the model). Ж 

By using the above procedure and varying the minimum acceptable 
value of H, for inclusion of the conditional relations, six ы! 
hierarchies were generated. Тһе cutting points for acceptable 
magnitudes of H, were chosen in such a way as to allow for com- 
parisons with the hierarchies with similar numbers of conditional 
restrictions generated by the other procedures and to allow for - 
Parisons among various structures generated by this procedure wit 
varying numbers of conditional relations between the domains. 
These hierarchies are schematically represented in Figures 2 through 7. 

The second general method used to empirically generate ibam 
hierarchies from the pilot data was based on a method Pasate by 
Bart and Krus (1973). In this procedure, the response patterns of in- 
dividual students were used in determining, what particular 
hierarchical relations among domains seems to best fit the data. This 
was done by considering the item response patterns for each individual 
9n the item from each domain in some specified order. Those patterns 
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18 
15 
17. Figure 2. Generated hierarchy based 
on the four largest Ни, values, (Note: The 
minimum value of H,, between domains 
with specified conditional relations is 
o 13 equal to .73.) 
10 
12 


which some specified minimum proportion of subjects had obtained 
were selected and used to generate a domain hierarchy. The 
hierarchies generated had the minimum possible number of specified 
conditional relations which were still “compatible” with the response 
patterns being considered. For a hierarchy to be compatible with a set 
of response patterns it is necessary that the conditional relations 
specified by the hierarchy allow all of the selected patterns of response 
to occur without the necessity of using an error component. By using 
this procedure and varying the minimum proportion of subjects neces- 
sary for the selection of a response pattern, three different hierarchies 
were generated. It should be noted that all possible criteria of minimal- 
ly acceptable proportions of subjects for inclusion of response patterns 
were considered. However, criteria of minimum proportion of Ss ob- 
taining a given response pattern falling below .02 were not used. This 
was because under such criteria the acceptable response patterns did 


18 
15 
17 Figure 3. Generated hierarchy num 
on the eight largest Ни values. (Note: ТІ е 
minimum value of Ни, between domain 
with specified conditional relations 15 
equal to .63.) 
10 
12 


| 
| 
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18 


15 


Figure 4. Generated hierarchy based 17 
on the nine largest H,, values. (Note: Тһе 
minimum value of Ну, between domains 
with specified conditional relations is 
equal to .58.) 


10 


12 


not provide for any acceptable conditional relations between domains. 
Criteria of minimum proportion of greater than .04 were also rejected 
since the number of response patterns falling above .04 was very small. 
The hierarchies generated are schematically represented in Figures 8 
through 10. 


Results 


Relations among Items within Domains 


The mean item scores, p, found within each domain showed ae 
siderable spread in magnitude. The specific values of p for wat i 
main were: p(10) = .71, (12) = .70, (13) = .61, (15) = 43, Pd 
53 and p(18) = .41. However, the standard deviation of item di 
culties found within the various domains were relatively small, (rang- 


18 
15 
| 17 
Figure 5. Generated hierarchy based 
Оп the ten largest transitive Ни values. 
(Note: The minimum value of Hi, between 
domains with specified conditional rela- 
tions is equal to 54.) 
10 


12 
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18 
15 
17 Figure 6. Generated hierarchy based 
on the eleven largest Н,, values. (Note: 
The minimum value of H, between do- 
mains with specified conditional relations 
13 is equal to .54.) 
10 
12 


ing from .02 to .09), when compared to the same statistic across 
domains, which was .13. At the same time coefficients of internal 
consistency (i.e. KR-20) among items within the various domains 
were found to be high. These coefficients ranged from .84 to 93. 


Analysis of the Domain Hierarchy Models 


Two general kinds of comparisons of “goodness of fit” provided by 
the various generated hierarchical models (о the actual data were CON” 
sidered, First, comparisons were made among hierarchical models 
within each hierarchical generation procedure with varying numbers 
of conditional relations. Second, comparisons were made among the 
models with similar numbers of specified conditional relations 
produced by the various generation procedures. 

The kind of evidence which was used as a means of comparing the 


18 
15 
17 
Figure 7. Generated hierarchy based 
on all of the Ни values. (Note: The 
minimum value of H, between domains 15 
13 equal to .45.) 
10 
12 
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18 


о 15 


Figure 8. Generated hierarchy based 17 
оп the seventeen most frequently occur- 
ting response patterns (Note: The 
minimum proportion of Ss obtaining a 
given response pattern for inclusion of 
that pattern was set at .02.) 


10 


12 


various generated models was mean item indexes of homogeneity, Ни 


These values were based on all pairs of items which were from different 


domains having hypothesized conditional relations between them (the 
міне whieh RN re slightly modified in 


Н, values which were used in this section we | ed 
that the hypothesized conditional relations were used in determining 
the direction of the conditional probability within Hi, rather than the 
relative difficulties of the items). ) 
The results in Table 2 show that the A, values based on the various 
generated models, (see columns designated 1), provide quite substan- 
tial coefficients when compared with the mean of all possible Ни 
values, .415 (these values were based on both of the possible con- 
ditional relations between all pairs of items from different aeo 
ог when compared to the mean of Ни values based on UN 
relations not suggested by the models, (see columns designat ). д 
may further be noted that, within each of the hierarchical generatio 


18 
15 


Figure 9. Generated hierarchy based M 
9n the twelve most frequently occurring 
response patterns. (Note; The minimum 
Proportion of Ss obtaining а given 
response pattern for inclusion of that pat- 
tern was set at .03.) 


10 


12 
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18 


15 


17 Figure 10. Generated hierarchy 
based on the seven most frequently occur- 
ring response patterns. (Note: The 
minimum proportion of Ss obtaining à 

13 given response pattern for inclusion of 
that pattern was set at .04.) 


10 


12 


procedures, as the number of cozditional relations between domains 
was decreased there was an increase іп the A, values suggested by the 
corresponding model. However, one may also note that there was à 
simultaneous increase in the H,, values based on all other conditional 
relations not suggested by the model. 

It next became of interest to determine the number of conditional 
relations which could effectively be specified within a given generation 
procedure. This was accomplished by assessing the “strength” of con- 
ditional relations found within a given model but not found in 
similarly generated models with fewer specified relations (i.e. the 
strength of conditional relations which were added to previously 
Specified relations were assessed). To determine the strength of the 
added conditional relations, Hy, values based solely on the added 
relations were used, These values are also presented in Table 2 
(see columns designated II). 

When this approach was used for the “Ну” generated models, it was 
found that the added conditional relations provided relatively large 
В, values for those models with eight or fewer specified relations, 
while those models with more than eight specified relations provided 
corresponding values which were markedly smaller. Similar results 
were found in the case of the “response frequency" generated models. 
However, in this case, the model with 10 specified relations showed а 
much less dramatic drop than that found in the corresponding "Hu 
generated model (the Я, values for the two added conditional rela- 
tions were .44 and .53, respectively, for the “Hj,” and the "response 
frequency" generated models with 10 specified relations). 

It was of further interest to note the differences in magnitude found 
between Hi, values based on added conditional relations within the 
models and the corresponding £j, values based on conditional rela- 
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tions outside of the models. The differences between these values are 
presented in Table 2, (see columns designated IV). Comparisons of 
these difference values showed that the model with eight or less 
specified conditional relations provided much larger differences than 
the other models. This was true for both of the empirical generation 
procedures. 

Comparisons were also made among models generated by different 
generation procedures with equivalent numbers of specified relations. 
This was done by comparing the Ej, values based on all of the 
specified relations within the models. Differences were only found in 
the case of those models with 10 specified relations. Here the 
"response frequency" generated model provided the largest value fol- 
lowed by the “Ну” generation model. The "theoretical" generated 
model was а low third, providing the lowest FJ, value of any of the 
models generated, including those models with larger numbers of 
specified relations. 


Discussion 


The results of this investigation suggested that the specified con- 
ditional relations, for all of the models considered, provided relatively 
good “‘fit” to the data (this lends support to the contention that stu- 
dents tend to learn how to correctly work items within specified do- 
mains prior to items in other domains). This assessment was inferred 
from the fact that the proportions of possible decreases in item diffi- 
culties based on the conditional information specified by the models 
were considerably larger than similar proportions based on non- 
specified conditional information (1.е., the effect of “success” on items 
from a given domain more greatly affected the probability of 
cess" on items from domains specified as being conditional on "'suc- 
cess" in the first domain). 
| This finding of relatively “good fit” across all hierarchies considered 
is not surprising since all of the generated models specified sets e 
ditional relations which were quite similar to one another. Thes? 
similarities among specified conditional relations were found both 
hs respect to the specific domains involved and with respect to the 
direction of the conditional relations. One of the most noticed 
characteristics, which was found to play a prominent role within allo 
the models generated, dealt with domains 10 and 12. This was pes 
all of the generated models tended to list frequently these two donee 
as prerequisites for the acquisition of the other domains. AE 
characteristic which was found throughout all of the models was Ee 
the specified relations placed domains with easier items as prerequr 
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sites for "success" in domains with more difficult items (there were 
only three specified relations in all of the models considered for which 
this was not true). This observation tends to suggest that level of item 
difficulty found within domains plays an important role in determining 
whether one domain is seen as a prerequisite for another domain. 
In order to determine how many conditional relations could “effec- 
tively” be specified, comparisons were made among the models with 
varying numbers of specified conditional relations. This was done 
separately for both of the empirical generation procedures. Here it was 
found that, as the number of specified conditional relations were in- 
creased, there was simultaneous decrease in the mean H,, values 
provided by the models. This suggests that both of these generation 
procedures tend to specify conditional relations which show less 
strength as the number of specified relations are increased. This is seen 
as being a major asset for these procedures, since it would allow an in- 
vestigator to increase the average strength of specified relations within 
a model simply by decreasing the number of relations specified. 
This phenomenon of decreasing Ну values which accompanies the 
increase of specified relations presented a problem for identifying an 
“optimum” number of specified relations. This is because it was 
desirable to obtain two antithetical characteristics in a model. First, it 
was desirable to specify an “adequate” number of conditional rela- 
tions to clearly describe any existing relations among domains, while 
at the same time specifying only those conditional relations which 
were “strong.” Thus, depending on how relatively important each of 
these characteristics is to an investigator, models with differing 
numbers of conditional relations may be seen as “most desirable. ; 
Further comparisons made among the models with respect to their 
“added” conditional relations (i.e., those conditional relations found 


within a given model but not found in similarly generated models with 


fewer specified relations) tended to suggest that the models with eight 
ations provided the most 


out of the thirty possible specified rel 

desirable number of conditional relations. This was true for both of 
the empirical generation procedures. The rationale behind this in- 
terpretation was that these particular models allowed for the specifica- 


tion of a maximum number of conditional relations without the occur- 
тепсе of large decreases in the “added” conditional relations. 

It is interesting to note that the “most desirable" models which were 
Benerated by the two different empirical generation procedures, are in 
fact identical models. The most prominent characteristic of these 
Models is their specification of domains 10 and 12 as prerequisites for 
all of the other domains. This may tend to suggest that the specific 


skills dealt with in these two domains plays an important role in 


596 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


preparing students for the acquisition of more difficult content in the 
general area of multiplication. 

Further comparisons were made among models with equivalent 
numbers of specified relations that were hypothesized on the basis of 
the different generation procedures. This was done to determine which 
of the three generation procedures produced the “best fitting” models 
and thus might be more desirable for use in future research. 

The results of these comparisons seemed to suggest that the “Ну” 
and the "response frequency" generation procedures generated 
models based on pilot data which provided strikingly similar “fits” to 
the data in the main study. This finding is quite reasonable if one con- 
siders the close similarity of these models. In the case of the models 
with four and ten specified conditional relations, only two of the 
specified relations in each case differed from one generation procedure 
to the other, while in the case of eight specified conditional relations, 
the models were found to be identical. In general, the comparisons 
between the two empirical generation procedures tend to suggest that 
they both provide an effective means of generating hierarchical 
models. In order of “effectiveness,” one might choose the “response 
frequency” procedure since it provided slightly better “fit” in the case 
of the models with 10 specified relations. However, in making a choice, 
it should be noted that the “response frequency” procedure did not al- 
low for the generation of as many different models. It also did not al- 
low for the specification of the exact number of conditional relations 
desired within a model. 

Comparisons made between the “theoretical” hierarchical model 
for domains based on “experts’ " judgments and the corresponding 
empirical models based on pilot data, suggested that the latter 
provided a somewhat better fit to the data. This seems to imply that 
"theoretical" generation may be a less effective means of accurately 
describing relations among domains. This finding also tends to negate 
the validity of the structure suggested by the "theoretical" model and 
raises a question as to how its structure might be improved. 

As might be expected, the similarity found between the 
"theoretical" model and the corresponding "empirical" models were 
less striking than those found between the two empirical models. A 
comparison of these models showed that of the ten relations specifie 
by the "theoretical" model, there were, respectively, four and five rela- 
tions which were inconsistent with those specified in the “HAN 
‘response frequency” models. Those portions of the “theoretical 
model which were found to be inconsistent with the corresponding e 
pirical models provided а Ё; value of .455. However, those portions 
which were in agreement provided а Й; value of .598. These findings 
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tend to invalidate those parts of the "theoretical" model which were 
inconsistent with the empirical models, while lending support to the 
remainder of the model. It is interesting to note that the conditional 
relations which are assessed as invalid, deal mainly with the impor- 
tance of the skills in domain 13 as a prerequisite to the acquisition of 
other multiplication skills (1.е., domain 13 as а prerequisite for do- 
mains 12, 15, and 17 along with domain 10 as a prerequisite for do- 
main 12 were found in the "theoretical" model but not in the cor- 
responding empirical models). Information regarding invalidated 
parts of the theoretical model are seen as providing valuable informa- 
tion to educators. This is because it allows them to raise questions 
about why these particular conditional relations did not hold for stu- 
dents and what implications this may have. 

One factor which placed limitations on the possible interpretation of 
the meaning of domain hierarchies was that teaching procedures were 
not actively manipulated. Thus, it was not possible to determine the 
extent to which different kinds of variables were actually affecting both 
the generation and fit of the obtained hierarchies. It is possible that 
either manner or order of presentation of content could affect the 
structure of the underlying domain hierarchy. On the other hand, the 
structure of the underlying hierarchy may be dependent on the manner 
in which skills related to various domains, form necessary prerequi- 
sites for the mastery of items from other domains. 
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SOME MULTIPLE RANGE TESTS FOR VARIANCES 


KENNETH J. LEVY 
State University of New York at Buffalo 


Often times, the experimenter is interested in making inferences 
about treatment variances instead of, or in addition to, inferences 
about means. Three multiple range tests are proposed for the pur- 
pose of specifying which treatment variances or sets of variances are 
homogeneous. The procedures are based upon the Fmax statistic, 
Cochran’s statistic, and a normalizing log transformation of the зат- 
ple variances. All three tests depend heavily upon the underlying as- 
sumption of normality. 


ALTHOUGH interest in means is frequently much greater than in- 
terest in variances both theoretically and empirically, nevertheless, 
instances do arise in which the experimenter is specifically concerned 
with the problem of making inferences about variances. Various pro- 
cedures are available for testing the hypothesis of homogeneity of 
variance associated with К independent normal populations (ші, 
92)1=1,2,..., К with unknown variances oi = 2, tse 
While the rejection of the hypothesis of equal treatment variances 


may be statistically interesting, it is in general not very useful. To 
know simply that a set of treatment variances differ is of limited use, 
ich treatment variances differ from 


because one still does not know whi 
one another. In 1956, H. A. David proposed a multiple range test for 
variances utilizing the Fmax statistic (Smax?/Smin’) and Duncan's (1955) 
philosophy with respect to the choice of significance levels at the vari- 
Ous stages of the test. The present paper proposes three different 
multiple range tests based upon the Newman-Keuls (1939; 1952) 
Philosophy with respect to these significance levels. The three tests 
Will utilize the Fmax statistic, Cochran’s statistic (Smax?/2i-1" 51°) and 
а normalizing log transformation of the sample variances respectively. 

With respect to a test for variances, the Newman-Keuls and Duncan 
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procedures may be clearly differentiated by defining a p-variance sig- 
nificance level. For two variances c;?, o3, let D(0, » оз?) denote the 
decision oj # оз’. For three variances 017, o, 03°, let D(a,’ # o? 
U og * 0} U 0? > ay) denote the decision that at least one of the 
variances differs from the other two and they may all be different. 
For a group of k variances the two-variance and three-variance sig- 
nificance levels for the sets ¢,2, о, and c^, o3, c", are, respectively. 


2 г с, 2 
а (от, 0» ) = зир PD? з oè) |o, 
2 2 
03 ,1** , 04 
= 02; Caen; c," arbitrary) 
2 2 2. 2 
aloi, 02,03) = sup P(D(a,? ¥ 022 Us? ¥ 03° Uns 
2 2 
04,7771 ‚ОЬ 
2 2 2 n 
zn | д? = о," = оз"; ca, +++, ox arbitrary). 


Generally, then, а р-уагіапсе significance level for а group of p 
variances is the probability of falsely rejecting the hypothesis that the p 
variances are all equal, this probability being maximized over ш 
remaining k-p variances. 

For the Newman-Keuls procedure, the p-variance significance levels 
for any p are set equal to a. In contrast, the Duncan p-variance signifi 
cance levels for a given p are 1 — (1 — a)P71, p = 2, 3, ...,К. he 
(1966) points out that for means, the Duncan levels do not rise 45 
rapidly as the nonsimultaneous separate test levels; however, they do 
increase with a fair amount of speed and soon exceed 1/2. Further, һе 
asserts that this violates the spirit of what simultaneous inference 5 
all about, namely, to protect a multiparameter null hypothesis against 
any false declarations due to the large number of declarations 1 
quired. For these reasons, this author prefers the Newman-Keuls 
procedure. 

А multiple range test for variances may. be performed utilizing Ш 
Fmax statistic, Cochran's statistic, or a procedure based upon à log 
transformation of the sample variances. Тһе Fmax and Cochran 5 
Statistics are both discussed at some length іп Winer (1971) 
respect to general tests for homogeneity of variance. Bartlett and Ken- 
dall (1946) investigated a normalizing log transformation of the sam- 
ple variance s* when sampling from a Ми, c?) population. "e 
showed that log,s? is approximately normally distributed as NOR т, 
2/n) where n is the number of degrees of freedom for s*. This transfor 
mation could be used in the following manner as a general test for 
homogeneity of variance. 


Suppose that independent random samples each of size и + аш 


— 
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drawn from k normal populations М(ш, o), i = 1,2,:-., k with un- 
known means and variances. When the null hypothesis is true, i.e., 
when o? = 0,2 = ++: = on? = 0°, then the log, s? will be independent 
and identically distributed, approximately as M(loge o?, 2/n). Pearson 
and Hartley (1942), have obtained the probability integral for the fol- 
lowing statistic: 
в, = Xmax = Xain 
where Xmax — Xmin is the range of an independent random sample of 
size n drawn from a N(u, c?) population. Thus, when the null 
hypothesis is true for К groups, then 
4/2/п 

will be approximately distributed as Ry. If the observed value of К» ex- 
ceeds the critical value obtained from the Person and Hartley table, 
one would reject the hypothesis of homogeneity of variance. 

Three multiple range tests will now be performed upon а 
hypothetical set of data following a procedure for testing means out- 
lined in Winer (1971). 


Suppose that a completely randomized experiment with 4 treat- 
ments and 10 subjects per treatment has been conducted. The ex- 


perimenter wishes to make inferences concerning which treatment 


conditions are homogeneous with respect to variance. 
1. Сотрще the sample variances of the 10 measures in each of the 4 


experimental groups. 3 2 
RH Qu — 89. 

ТЕ, = п = 1 

where = the number of subjects in the ith group and n — 1 degrees of 


freedom for 5/2. 


Suppose the following results were obtained: 


s 


G سے‎ 
T 7% 
2 12.00 
3 3.00 
4 2.00 
2. Order the sample variances horizontally and vertically as follows: 
(1) (4) (3) (2) 
15 2.00 3.00 1200 
(1) 15 
(4)2.00 


3)3.00 Й | 
(0 cere that .75 is the sample variance associated with group 1% 
ups taking the ratio of the ex- 


3. Compute the Fmax statistic for 4 gro 
vey tenes 16.00. From tables in Winer 


tremes (2)/(1) i.e., Fmax = 12.00/.75 = 
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(1971), the critical value for a .05-level test for 4 groups each м 
degrees of freedom for s? is 6.31. Since the observed ratio exceeds; 
the hypothesis of homogeneity of variance is rejected. Therefore 
asterisk should be entered in cell (1, 2) in the above table. | 

The basic credo of this multiple range test may be stated as foll 
the differences between any two variances іп а set of К varianc 
significant provided the ratio of the extremes (max/min) of each 
every subset which contains the given variances is significnat acco 
to an ap level test where p is the number of variances in the subset 
cerned. Thus, if the initial test ratio were not significant, no furt 
tests would be made. 

4. Since the initial observed ratio was significant, proceed to t 
diagonal ratios (3)/(1) and (2)/(4). The critical value for a .05-lev 
for 3 groups each with 9 degrees of freedom for s? is 5.34. F; 
(3)/(1) = 3.00/.75 = 4.00; Fmax = (2)/(4) = 12.00/2.00 = 6.00. 
do not reject the hypothesis that 6,7 = 0,2 = од; however, do ге) 
hypothesis that c? = аз? = ,?. Since the ratio (3)/(1) is not signi 
one is not allowed to test any other differences within the tr 
bounded in the upper right corner by the cell (1, 3). Since th 
(2)/(4) is significant, an asterisk is recorded in cell (4, 2) and ой 
proceeds to test the ratio (2)/(3). | 

5. Fmax = (2)/(3) = 12.00/3.00 = 4.00. The critical value for 
level test for 2 groups each with 9 degrees of freedom for өзі 
Since the observed ratio does not exceed the critical value, do 
ject the hypothesis that оз? = c. 

In summary, then, one may conclude that the variances asso 
with groups 1, 4, and 3 do not differ; groups 3 and 2 do not 
however, the variance of group 2 is significantly greater th 
variance of either group 1 or group 4. In terms of Duncan's noti 


(1) (4) (3) (2) 


where groups underlined by а common line do not differ from 
another, groups not underlined by a common line do differ. | 

Similar tests may be performed utilizing Cochran’s statistic al 
normalizing log transformation discussed above. For a test ba 
upon Cochran’s statistic, the initial observed ratio would be | 


2 
CS Е 
5: 


ke 12.00 
(75 + 2.00 + 3.00 + 12.00) 


= .6761 
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From tables іп Winer (1971), the critical value for a .05-level test for 4 
groups each with 9 degrees of freedom for s? is .5017. Since the 
observed ratio exceeds .5017, reject the hypothesis of homogeneity of 
variance and proceed to step 4. 

For a test based upon the normalizing log transformation, one must 
compute loge 547, 


Group 5t log. s* 
1 175 —.2877 
2 12.00 2.4849 
3 3.00 1,0986 
4 2.00 0.6931 


The initial test statistic for this case is: 


_ (log, Smax — 108, Ec Nae 
4/2 


R, 


_ (2.4849 + .2877)3 _ 5.8816 
= ИИ 


From the Pearson and Hartley (1942) tables, the critical value fora 
.05-level test based upon the range of the log, 517 for 4 groups 18 3.62. 
Since the observed ratio exceeds 3.62, again reject the hypothesis of 
homogeneity of variance and proceed to step 4. | | 

It should be noted, that in general, tests for homogeneity of variance 
are extremely sensitive to violations of the underlying assumption of 


normality. Box (1953) points out that most tests do not utilize any 


evidence of variance variability within the samples. The sample 
d the theoretical variability 


variability is measured theoretically, an ; | 
changes as the underlying distribution changes. For this reason, the 
above tests were studied via Monte Carlo techniques. with sampling 
occurring from normal, uniform, and double exponential populations. 
It was found that all three tests were seriously affected by non- 


normality with respect to both their significance levels and their em- 


рігіса! power. When sampling occurred from normal populations, the 
g transformation 


Fmax test and the test based upon the normalizing log tra 
were found to be most ensis to those cases in which a sample 
Variance was anomalously small. Іп contrast, the test based upon 
Cochran's statistic was found to be most sensitive to those cases In 
Which one or more sample variances were anomalously large. In con- 
clusion then, one should be reasonably confident about the assump- 
tion of normality before proceeding with any of the procedures discus- 


sed above. 


604 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
REFERENCES 


Bartlett, M. S. and Kendall, D. G. The statistical analysis of variance 
heterogeneity and the logarithmic transformation. Royal 
Statistical Society, 1946, 8, 128-138, 

Box, G. E. P. Non-normality and tests on variances. Biometrika, 1953, 
40, 318-335. 

David, H. A. The ranking of variances in normal populations. Journal 
of the American Statistical Association, 1956, 51, 621-626. 
Кеші, M. The use of the "studentized range" in connection with an 

. analysis of variance. Euphytica, 1952, 1, 112-122. 
Mien R, G> рш statistical inference. New York: McGraw- 
ni, я 

Newman, D. Тһе distribution of the range in samples from a normal 
population, expressed in terms of an independent estimate of 
standard deviation. Biometrika, 1939, 31, 20-30. 

Pearson, E. S. and Hartley, H. O. The probability integral of the range 
in samples of и observations from a normal population. 

‚ Biometrika, 1942, 32, 301-310. 

Winer, B. J. Statistical principles in experimental design. New York: 

McGraw-Hill, 1971 


| 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
1975, 35, 605-611. 


JUDGMENTAL BIAS IN THE RATING OF 
ATTITUDE STATEMENTS' 


WILLIAM H. BRUVOLD* 
University of California, Berkeley 


Judges holding divergent attitudes toward high contact uses of 
water reclaimed from community sewage rated two sets of attitude 
statements regarding this issue. Results showed a close linear 
relationship between item scale values obtained from positive and 
negative attitudinal groups, and also a somewhat reduced range of 
ratings for judges holding unfavorable personal attitudes toward 
reuse. These findings, and the findings of previous research on this is- 
sue, were seen as being consonant with an item displacement theory 
of rating performance and supportive of equal interval measurement. 


Controversy has encompassed the equal-appearing intervals at- 
titude scaling procedure ever since Hovland and Sherif (1952) chal- 
lenged the assumption that judges’ personal attitudes do not influence 
item placement. Of the many attempts to deal with the issue, five 
relatively distinct explanations of the influence of attitude upon the 
judgmental process seem to be most prominent at this time. | 

First, the well known notions of Hovland and Sherif (1952) continue 
to maintain their importance primarily because these authors were the 
first to clearly stipulate that relationships between median item place- 
Ments would be specifiably curvilinear for different attitudinal groups. 
The curvilinear hypothesis has important implications for a theory of 
judgment and important implications for measurement theory. If me- 
dian item placements are not linearly related then attitude scales con- 
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structed by the method of equal-appearing intervals can not represent 
interval level measurement (Torgerson, 1958). 

Second, and in response to the Hovland and Sherif (1952) paper, 
Hinckley (1963) replicated his earlier work supposedly demonstrating 
no influence of attitude on item rating. The no influence position was 
reasserted on the basis of the replication study. The implication ofthis 
position regarding interval level measurement is very clear. If judges 
personal attitudes do not influence item ratings, then median item 
placements for any two groups of judges, regardless of their typical at- 
titude, should be identical. Such a result would produce precisely 
linear relationships between item medians having slope one and in- 
tercept zero. Such results would support, rather than refute, attain- 
ment of interval level measurement (Torgerson, 1958). ў 

Third, Upshaw (1965) has developed a variable series theory main- 
taining the linearity notion contained in the no influence of attitude 
position and also accommodating the idea that attitudes do have à 
systematic effect on ratings. The outcome of Upshaw's (1965) effort p 
that interval level measurement is supported іп terms of Torgerson's 
(1958) master scale evaluation technique, yet attitude is seen as ап Im- 
portant determinant of obtained judgments through its influence over 
the judge’s perspective on the items rated. The linearity aspect of 
Upshaw's (1965) expectation was reasonably validated by his own 
research while the expectation regarding the influence of attitude on 
median item ratings was not confirmed. 

In fact Upshaw's (1965) findings on the relation of attitude to me: 
dian item placement were consistent with the findings of Zavalloni and 
Cook (1965). The latter note that their results appear to confirm an 
emerging generalization about the relation of judges' attitudes ап 
item placement on the equal-appearing interval continuum; далел 
that judges with favorable attitudes tend to employ a wider range 0 
ratings which results from more positive ratings for positive items, an 
more negative ratings for negative items, than those given by judges 
having unfavorable attitudes toward the matter at hand. Zavalloni an 
Cook (1965) account for these results in terms of the judge's agreement 
or disagreement with items in conjuction with a tendency 10 rats 
agreed-with items more favorably and disagreed-with items less 
favorably, While these authors posit a clear and definite effect of а 
titude upon judgments, the implication of their explanation 
measurement theory is not clear. The displacement theory of Zavalloni 
and Cook (1965) theory could remain viable with either linear ОГ cun 
vilinear relations between median item placements; however, it d 
pears that displacement theory would expect near linear relationship 
if certain neutral items show little displacement. If relationships 27% | 
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linear, or nearly so, and if attitudes of the more favorable group were 
represented on the x-axis, the slope should be less than one and the y- 
intercept greater than zero. 

The fifth prominent explanation of the influence of attitudes upon 
item judgment has recently been developed by Eiser (1971). This effort 
appears to be primarily an elaboration of the Zavalloni and Cook 
(1965) position. The major effect of attitude is held to be item displace- 
ment as described above and it is supposedly due to perceived item 
contrast based upon the judges' personal reaction to individual items. 
This effect does not operate alone, since, in addition, measureable 
effects of social norms and anchoring are posited. The joint operation 
of these three factors is supposed to produce median items for different 
attitudinal groups that are curvilinearly related. Eiser (1971) does not 
specify the character of the nonlinearity expected nor its basis in the 
three factors posited. Nevertheless, the expectation of curvilinearity 
represents an extension of the Zavalloni and Cook (1965) explanation 
that is uncongenial to interval measurement. 

The character of the mathematical relationship between median 
item placements obtained from divergent attitudinal groups is a 
central issue for the five explanations here reviewed. Two of the five 
predict curvilinearity, two predict linearity, and one suggests a certain 
kind of linear relationship. Further, as mentioned above, the presence 
or absence of linearity is most important for reasons involving theory 
of measurement. In view of the centrality of this issue it is surprising to 
note that none of the five articles referenced used standard statistical 
procedures (McNemar, 1969) to describe and evaluate the relationship 


of interest. Analyses have involved either course groupings and 


analysis of variance without tests for trends, or simple correlation 
coefficients without accompanying eta values. A major purpose of the 
present study is to assess the relationship of interest using appropriate 
statistical techniques that will more adequately evaluate the five com- 
peting explanations here summarized and their implications for theory 


of measurement. 
Method 


One hundred and seven individuals served voluntarily as judges in 
this research and all were residents of California. Sixty-four judges 
were male and ages ranged from 26 to 73 years. An attempt was made 
to obtain the widest possible spectrum of opinion on the matter of 
reuse of reclaimed water and therefore judges were recruited from 
University classes, a professional engineering group, à well known 
conservation society and a local social club. No strict sampling plan 


608 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


was followed, rather the general aim of recruitment was to obtain a 
diverse group of judges. 

Two groups of statements, developed in an earlier and totally 
separate study (Bruvold, 1971), dealing with the use of water reclaimed 
from sewage for drinking, or Гог swimming, were employed in this 
research. There were 83 statements regarding drinking and 58 involv- 
ing swimming. It should be strongly emphasized that statements 
regarding beliefs or behavioral intentions were deliberately excluded 
from the item pool. 

Items were typed, one to a plain white 3" X 5" card, and labeled at 
random with a one letter-three digit number combination which ap- 
peared in the upper right corner of each card. Order of items within 
sets, and order of the two sets themselves, were randomly arranged 
before each individual judging session. Standard equal-appearing in- 
terval judgment procedures were used as in the earlier work (Bruvold, 
1971). Each judge rated, for practice only, under the observation of the 
experimenter, five statements not included in the major item sets. 
Questions regarding judgment procedures were fully discussed before 
proceeding to the major item sets. Each judge worked alone to com- 
plete the major task, each gave a complete set of ratings, and the 
ratings of all were included in subsequent statistical analyses. 

Upon completion of the item rating task each judge's personal at- 
titude toward reclaimed water for drinking, and swimming, was asses- 
Sed by two Thurstone scales. Each scale was comprised of twenty 
statements selected from Remmers (1934) stems dealing with attitude 
toward any practice. Reclaimed water for drinking, or for swimming, 
was inserted in place of the term “this practice" in each statement. No 
item was common to both scales. In completing each scale the judge 
was required to check the three or four items nearest to his own per- 


sonal attitude. Scores on both scales could range from a low of 1.0 toa 
high of 11.0. 


Analysis and Results 


Attitude scale scores were obtained for each judge and scale and 
then rank ordered separately by scale. Scores for the drinking scale 
ranged from 2.0 to to 9.9 and those for swimming from 2.2 to 9.7. Ав 
in other work the distributions were divided into quintiles here con- 
taining 21 judges each except for the third which contained 23 judges. 
Remaining analysis focus upon the lowest and highest quintiles. For 
drinking the highest attitude score in the first quintile was 4.2 and 
the lowest Score in the fifth quintile was 8.6. Analagous figures for 
Swimming were 4.8 and 8.4. Four sets of equal-appearing interval 
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item scale values were obtained by standard methods (Bruvold, 1971), 
one for each scale and attitudinal position. Subsequently, the relation- 
ship between median item scale values obtained from judges positive 
or negative in attitude toward reclaimed water for drinking was 
analyzed using statistical tests for linearlity and curvilinearity proposed 
by McNemar (1969). The same procedure was also performed on the 
swimming data. These analyses were made possible by taking all 
median item values from the positive judges, the independent or 
x-variable, as category mid-points of 1.5, 2.5,.-., 10.5 while median 
item values from the negative judges, the dependent or y-variable, 
were left ungrouped. Thus, for example, any item scale value from 
the positive attitude group beginning with the integer 6 was recorded 
as 6.5 whereas associated y-variable item values from the negative 
attitude group were left unchanged. 

Results from the swimming data are now summarized. These 
analyses were performed for 83 items and the regression equation was 
y' = 0.944 x + 0.866. The F-ratio for the linear component of the 
regression was 2,199.98 (p < .001, 1/73df) and it was 1.73 (p > .05, 
8/73 df) for the curvilinear component. The Pearson r between 
positive and negative item values was 0.981 and the corresponding eta 
equalled 0.984. A 1 test assessing the deviation of the obtained slope 
from 1.000 equalled 2.80 (p < .01, 81 df). Analagous figures for the 58 
swimming items are now presented. The regression equation was у" = 
0.889 x’ + 0.985. Тһе F-ratio for linearity was 1,764.61 (p < .001, 1/48 
df) and for curvilinearity it was 0.178 (p > .05, 8/48 df). Pearson r 
equalled 0.985 and eta was 0.988. The 1 test of the difference of the ob- 
tained slope from unity was 5.55 (p < .001, 56 df). 


Discussion 


e most uncongenial to positions 


(Hovland and Sherif, 1952; Eiser, 1971) predicting curvilinear 
relationships between median item ratings obtained from eig 
divergent in personal attitude toward the issue under study. Results o 
the regression analyses reported above showed that curvilinearity 
failed to reach customary criteria for statistical significance. This in- 
terpretation is strongly substantiated by comparison of r and eta 
values. Thus expectations predicting systematic non-linearity receive 
little support from these data and, further, the results do not indicate 


that interval level measurement is precluded for equal-appearing inter- 
val methods because of the effects of personal attitudes upon 


judgment-produced scales. | 
Absence of significant curvi 


The results here obtained wer 


linearily between median item ratings 
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for divergent attitudinal groups, does not give Hinckley's (1963) no- 
effect position the support of these data. As noted above, Upshaw 
(1965) obtained an item displacement effect, more fully documented 
by Zavalloni and Cook (1965) and then confirmed by Eiser (1971) 
which indicates that judges holding unfavorable personal attitudes 
rate negative items more positively, and positive items more 
negatively, than judges holding favorable personal attitudes. The 
general implications of this expectation for linear slope and intercept 
values were outlined in the introduction section. It may be noted here 
that both regression equations reported above fit these general expec- 
tations. The analysis of the difference of obtained slopes from unity 
shows that these observed values were significantly less than one. 
Somewhat larger rating dispersions obtained for the positive at- 
titudinal groups are consonant with the slope results. The consistency 
of these findings yields additional support for the Zavalloni and Cook 
(1965) position. 

Summarizing, it may be stated that the present data provide support 
for linear item displacement theory emerging in the work of Upshaw 
(1965), and Zazalloni and Cook (1965). Item displacement theroy need 
not invoke apparently complex explanatory constructs. Rather, the 
major effects of item displacement may be due to competing tenden- 
cies faced by an individual when learning to perform the judgments re- 
quired by the method of equal-appearing intervals. The writer and his 
Students have noticed that many individuals experience difficulty when 
learning to give equal-appearing interval ratings. Most, if not all, do 
not seem initially to understand that their judgments of, and not 
Tesponses to, items are sought. The difficulty may stem from the usual 
psychometric practice of asking for responses to items rather than 
judgments. If raters do not, or can not, fully set aside a response 
tendency, then the rating of a particular item will be the result of two 
factors: how well the rater likes and agrees with the item, and also his 
judgment of where it belongs on the equal-appearing interval con- 
tinuum. Consonant with the notions of Zavalloni and Cook (1965), 
the joint operation of competing response and judgmental factors 
Would result in smaller rating dispersions, higher ratings for negative 
items, and lower ratings for positive items, for raters unfavorable in 
personal attitude when compared to raters holding favorable personal 
attitudes toward the issue at hand. 

The position here expressed, while seeming to best account for р АШ 
апа present results, also suggests topics for future study. First, item 
displacement would likely be reduced as adequacy of rating instruc- 
tion Increases. Since the present study used reasonably thorough in- 
structions involving several example statements, the item displacement 
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effect, it would follow, should not have been large. Future research can 
determine if this reasoning has predictive value. Second, item displace- 
ment would likely be reduced as rating instructions are understood and 
accepted. Third, demand characteristics or perceived social pressure to 
respond to items rather than to judge them would likely enhance item 
displacement. Reading of the Hovland and Sherif (1952) article sug- 
gests that such effects may have been operative in that research. 
Conversely, perceived social pressure not to respond to items, but to 
try to judge them “objectively,” would likely reduce item displacement. 
Finally, analysis of the tendency to personally respond to individual 
items should prove interesting. Such analysis could lead to further ex- 
pectations. regarding individuals and items most likely to evidence dis- 
placement. 
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EMPIRICAL OPTION WEIGHTING WITH A 
CORRECTION FOR GUESSING' 


RICHARD R. REILLY 
Educational Testing Service 


Because previous reports have suggested that the lowered validity 
of tests scored with empirical option weights might be explained by a 
capitalization of the keying procedures on omitting tendencies, a 
procedure was devised to key options empirically with a “correction- 
for-guessing" constraint. Use of the new procedure with Graduate 
Record Examinations (GRE) data resulted in smaller increases in 
reliability than those observed when unconstrained procedures were 
used, but validities for quantitative subforms were not appreciably 
lowered. Validities for verbal subforms were lowered slightly, 


however. 


Two recent reports (Hendrickson, 1971; Reilly and Jackson, 1972) 
have suggested that weighting options empirically results in substan- 
tial increases in reliability and test homogeneity, but at the expense of 
lowered test validity. These findings are at variance with those 
reported in an earlier study by Davis and Fifer (1959) who found 
similar increases in reliability and slight increases in validity when ор- 
tions were weighted empirically. АП three studies employed modifica- 
tions of a weighting technique originally known as The Method of 
Reciprocal Averages (Mosier, 1946) which, in effect, maximizes the 
product-moment correlation between item scores and criterion scores 
by assigning to each item-option values proportional to the mean 
criterion score for all individuals choosing that option. 

A key difference between the Davis and Fifer study and the first two 
mentioned was that tests in the first two were administered with for- 
mula score instructions while Davis and Fifer instructed examinees to 


! The research reported herein was supported by the Graduate Record Examinations 


Board. 
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attempt every item. Thus, Hendrickson and Reilly and Jackson had an 
additional “option,” that of omit. Hendrickson, reporting on the 
weights generally assigned to the omit category comments, **. . . An in- 
teresting finding of this study was that the weight of ‘omit’ was almost 
always lower than any of the other distracters in an item.. E 
(Hendrickson, 1971). Reilly and Jackson (1972) take this a step further 
and suggest that, “... the empirical keying procedures described 
capitalize on the tendency to omit and ... while this tendency is 
reliable, it is not valid." 

Because of these suggestions, it was decided to devise and test a 
procedure which weighted options subject to the constraint that the 
weight for omit equal the mean weight for the options. The rationale is 
similar to that used in the usual formula scoring method in that it as- 
sumes that an individual omitting an item should receive the expected 
weight under conditions of random response to that item. А 

In order to determine the optimum weights for a single item, subject 
to the “correction-for-guessing” constraint, the following objective 
function was set up: 


Е ХХ bu — wd? — 2 — |, D wh 
where 


Yu denotes the criterion score of the ith 
individual making the jth response; 
w; is the weight for the jth response, 
TE кана 
Wp is the weight for the omit category. 
ду = one for ж р, and zero otherwise; and 
À is the LaGrange multiplier. 


Taking partial derivatives and solving for the weights which 
minimize the function we find that the solution, which requires а small 
[(k — 1) X (k — 1)] matrix inversion, has the following properties” ( 1) 
The mean item Score over all individuals is equal to the mean criterion 
score; (2) the weights arrived at are proportional to the weights which 
will maximize the Correlation between the item and the criterion sub- 
Ject 10 the constraint of a fixed item variance (and, of course, the con- 
straint that the omit weight equals the mean of the option weights); 
(3) unlike the constrained option weights, the weights arrived at will 
not, in general, yield the maximum possible product-moment correla- 
tion; (4) for unconstrained weights it has been pointed out (Stanley 
and Wang, 1970) that a slope of 1.0 and a zero intercept will describe 


? The full proof is available from the author on request. 
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the regression of the criterion scores on the item scores. The appro- 
priate slope for the regression of criterion scores on item scores yielded 
by the new method will not, in general, be 1.0, nor will the appropriate 
intercept, in general, be zero. 


Procedure 


Two parallel forms each, of the verbal (denoted as V; and Vz) and 
quantitative ( О, and О, ) sections of the Graduate Record Examina- 
tions (GRE), were devised by assigning one-half of the items on each 
section to each of the two special parallel forms. Forms V; and V; con- 
sisted of 50 items each, while forms О, and Q; consisted of 27 items 
each. It should be noted that the two forms in each set, since they were 
constructed from operation tests, were not administered under 
separate time limits. Because of practical limitations the more 
desirable procedure of administering the two parallel forms under 
separately timed conditions was not possible. 

Data were the same as these used in the Reilly and Jackson (1972) 
study. A spaced sample (i.e., a sample consisting of every nth answer 
sheet) of 5,000 answer sheets (sample A) from the December 1970 ad- 
ministration of the GRE was employed for study purposes, А second 
sample (sample B) consisting of the answer sheets of 4,916 individuals 
from the same administration was taken for validation purposes. Sam- 
ple A was divided into two randomized block groups of 2,500 (samples 
А, and A, ) by blocking on total GRE score. The 5,000 answer sheets 
were ordered in terms of the verbal score plus the quantitative score 
and then randomly assigned to the two subsamples. This increased the 
likelihood that the two split samples would be comparable in terms of 
total score distributions. Each subtest was keyed against the scores on 
its parallel form in sample A; . The tests in sample A; were then scored 
using these derived weights and intercorrelations, and alpha coeffi- 


TABLE 1 
Cross-Validated Parallel Forms Reliabilities for 
Empirically Keyed and Formula Scored Subtests 


Formula Empirically Keyed К" 
.8909 .9242 1.49 
Verbal 8 8892 1.16 


Quantitative .8742 
* K gives the estimated proportional increase in test length which would be necessary 10 yield the increased Rs 
shown, Rearranging the Spearman-Brown prophecy formula, 


_ Rul RD. 
КО EURO. 


where: Ry is the R obtained with Formula score weighis and Jt»! js the cross-validated At obtained with empirical 
Weights, 
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TABLE 2 
Cross-Validated Internal Consistency Coefficients for 
Formula Scored and Empirically Keyed Tests 


Empirically 
Formula Keyed К" 
У, .8745 .9069 1.40 
У, .8755 9084 1.41 
a 8515 8817 1.30 
Q: .8725 .8852 1.13 


* К gives the estimated proportional increase in test length which would be necessary to yield the increased a's 
shown. Rearranging the Spearman-Brown prophecy formula, 


к- gd — ap) 


«к(1— аш” 


where ағ is the а obtained with formula score weights and aw is the cross-validated а obtained with empirical weights. 


cients were computed. Thus, all results reported are those obtained 
with cross-validated weights. 

The next step involved scoring the sample B answer sheets and com- 
puting the single order and multiple correlations between the em- 
pirically keyed tests and undergraduate GPA. Sample В was drawn 
from a total of 40 different colleges. Within-school samples ranged 
from a low of 16 to a high of 399. A modification of one of Tucker's 
Ше) central prediction methods was used to pool data across col- 
eges. 


Results and Discussion 


The results of the keying on parallel forms reliability and internal 
consistency are presented in Tables | and 2. The proportional in- 
Creases in effective test lengths are comparable to those reported by 
Hendrickson (1971) but less than those observed by Reilly and 
Jackson (1972). The smaller increments observed for the quantitative 
tests are consistent with previous findings, and may, as Hendrickson 
(1971) suggests, be related to the common observation that differences 
in the quality of the distracters are less apparent for general 
mathematical items than for verbal items. 

His né geris (1972) Observed increases in the correlations 
Marion a and quantitative tests when empirical weights were 
я attributed these Increases to the capitalization of the keying 

ern vs оп an omitting factor common to both tests. Thus, the : 
ed don in Table 3 are of interest since they indicate that when 
ed weights are used the large increasés in verbal-quantitative 


= 
“Тһе method used is a least. squares 5 
-squai 
more fully described in a ES by ron out by Robert Е. Boldt and is 
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TABLE 3 
Intercorrelations between Verbal and Quantitative Forms 
for Formula Scored and Empirically Keyed Tests 


Empirically 
Formula Keyed Expected" 
УО, 4154 4577 4269 
VQ, 4190 4428 4550 
VQ» 4079 4304 4191 
VQ: .4061 .4138 4173 


"The expected values represent the expected correlation which should have resulted from the increased 
reliability of the empirical key scores. These values were obtained by multiplying the true formula score corre- 
lations between V and Q by the geometric mean of the empirical key. score reliabilities. Parallel forms reliabilities 
were used in all cases. 


correlations do not occur. When increases in reliability are taken into 
account the increases are actually slightly less than expected in two of 
the four cases shown and slightly greater than expected in the remain- 
ing two cases. 

In Table 4, the correlations are shown between pairs of parallel sub- 
tests, one scored with empirical weights and the other with formula 
weights. These latter correlations are, in general, slightly higher than 
the parallel forms reliability, in contrast to the uniformly lower values 
EUM when unconstrained weights were used (Reilly and Jackson, 

72). 

The validity results are presented in Table 5. From à conceptual 
point of view the most desirable criterion for assessing validity would 
have been some measure of graduate school performance. The small 
within-school sample sizes as well as the generally restricted variance 
in graduate grades made this unfeasible. Undergraduate grades are a 
readily available concurrent measure of academic achievement and 
бестей a reasonable criterion against which 10 validate the different 
scoring methods. While the zero-order validities for the quantitative 
forms are almost unchanged, the multiple correlations are slightly 


TABLE 4 
Intercorrelations between Empirically Keyed and Formula Scored 
Parallel Forms 
ігі la 
Parallel Forms Empirically Keyed vs. Formu! 
——— Reliability Scored Parallel Form" 
I п 
зера) .8909 8953 8914 
Quantitative 8742 8726 8848 


keyed and form У, (О) formula scored. 


* Column | : Cer 
shows the correlation between form V, (Qu) empirically (Q.) formula scored. 


Column 2 shows the correlation between У, (Qs) empirically keyed and Vs 
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TABLE 5 
Pooled and Median Validity Coefficients 
Unconstrained Constrained 
Formula Scores Weighted Scores? ^ Weighted Scores 
V, Median :3069 2467 2768 
У, Pooled^ 2703 .3167 .2998 
Q, Median .1407 .1299 .1386 
О, Pooled .1664 .1909 .1894 
V, Q; Median 2768 .3443 .3145 
У, Q; Pooled 12666 .3184 .2997 
V; Median .2987 .2358 2841 
У, Pooled .2939 .2532 2828 
9; Median 21679 .1504 1681 
9; Pooled 2055 1847 .2054 
У, 9, Median .3135 .2589 .3036 
У, Q; Pooled 3013 .2637 2919 


{The unconstrained weights were those obtained by keying against parallel forms (Reilly and Jackson, 1972). 
Pooled single order coefficients were estimated as follows: 


ША! 
r= Ew 
т” 


multiple corrlation coefficients were obtained using а pooling procedure described by Briggs (1970). 


lower overall owing primarily to the decreases in the correlations 
between GPA and the empirically keyed verbal subtests. It is difficult 
10 explain why, even with the modified keying procedure, the verbal 
test validities were lowered. Apparently, the empirically keyed verbal 
tests аге measuring some additional factors which, though reliable, 
may not be valid. 


Conclusions 


While the results reported here certainly do not indicate that steps 
should be taken to implement empirical option weighting, the findings 
аге not entirely discouraging either. It has been shown that a test can 
be made more reliable and more homogeneous through option weight- 
ing and, at least for the quantitative forms, without any appreciable 
lowering of validity. 
Eis sed should be done on several key issues which have 
"n " E Ls this study. First, the issue of omitting behavior should be 
os si closely. Breen (1972) has presented data for the SAT 
м 7 icate that “omit” scores аге even more reliable than rights- 

nly or formula Scores. It may be that an omitting score can be used a$ 
а suppressor variable along with the formula score to increase the cor- 
relation with the criterion. 

Another interesting and potentially useful study would be one which 
examined the effects of keying options directly on the GPA criterion. 
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Examination of the weights for options may reveal consistent patterns 
which could be helpful in guiding item writers. 
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THE CONCEPT OF EFFICIENCY IN ITEM ANALYSIS 


RICHARD J. HOFMANN 
Miami University* 


In this paper a new item analysis index, e, is derived as a function 
of difficulty and discrimination to represent item efficiency. It is 
demonstrated algebraically that the maximum discriminating power 
of an item may be determined from its difficulty and then item effi- 
ciency is defined as the ratio of observed discrimination to maximum 
discrimination. The magnitude of the e-index will range from zero to 
unity and will provide additional information for item analyses. 


IN a typical analysis of a test item, two indices are usually computed, 
а difficulty index and a discrimination index. If one assumes an 
analysis based upon the performance of two groups on the item, 
typically referred to as a U-L analysis, then a two by two contingency 
table may be used in the tabulation of the indices. Such an approach 
would typify the approaches suggested by Kelley (1939), Johnson 
(1951) and Cureton (1959) and is discussed in almost any basic 
measurement text devoting some space to item analyses. 

Assume that № individuals have responded to some item in either a 
Positive fashion, r, or a negative fashion, w. Furthermore, assume that 
either on the basis of their total scores on the instrument associated 
With the item or on the basis of some outside criterion two equal 
groups, g, and е, are determined from №. In this case 81 and g; in total 
represent N individuals where N may be equal to or proportionate to 
y '. Then using the subscripts 1 and 2 to denote those symbols as- 
sociated with g, and рз, respectively, ће М responses to the item are 
Presented in Table 1. 


Е CER UAE 
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TABLE 1 
Contingency Table Summarization of Two Groups Responses to a Single Item 
Response Group 
Category gı Ёз Total 
Positive n т п%п 
Мерайуе wi Wa wi We 
Total пі FHF Wa N 


The difficulty index a of the item may be denoted as: 


ndr 
LR RT (1) 
and it may range from zero to unity. One interpretation of a is that it 
represents the proportion of N individuals responding positively to the 
item. A second interpretation is that a represents the probability of 
observing a positive response to the item, P(r). i 
The discrimination index of the item, b, may be denoted as: 


(2) 
where 
ntwtntw-2N 


and it may range from positive to negative unity. One interpretation of 
b is that it represents the difference between two conditional 
probabilities, the probability of a correct response given membership 
in group one, less the probability of a correct response given 
membership in group two. 

In à very special sense the marginals of Table 1 are fixed and there is 
ап interdependence between difficulty. and discrimination. In this 
paper discrimination is assumed to be a function of difficulty and thus 
Its magnitude for any item is tempered by the magnitude of the item's 
difficulty index. 


A frequent problem encountered by “users” of item analyses is one 


of interpreting both indices, difficulty and discrimination, 


simultaneously and making a decision about the disposition of an 
item, either retaining or rejecting it for future use. All too frequently 
interpretations are confused when either or both indices depart, even 
slightly, from 50 for difficulty and positive or negative unity for dis- 
crimination. (It should be noted here that it is a popular misconception 
that the ideal diffculty index should be .50. For a comprehensive dis- 
cussion of this point see Henrysson, 1971.) 

The major objective of this paper is one of deriving a new index that 
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will facilitate the interpretation of item analyses. In this paper a new 
index, e, is presented as a function of difficulty and discrimination. 
Conceivably, the e index might facilitate as much if not more informa- 
tion than the simultaneous interpretations of difficulty and discrimina- 
tion while at the same time it should be less confusing as it may be in- 
terpreted within a simple probability framework. 


An Algebraic Rationale for Maximum Discrimination 


Assume that, a < .50, the item was difficult, then, less than one-half 
of the individuals responded positively to the item. It is possible, ideal- 
ly, for all positive responses to have occurred in g;, and all responses 
given by g; would be negative responses leading to the inequality 


rı Fra n € Wt Wa: 


Let the superscript * denote maximum values. Then if (7; + г; < n)it 
may be assumed that all positive responses occurred in g; . The max- 
imum discrimination of an item, b*, may now be defined algebraically 
as: 

p- nc re (3) 
п 

Inasmuch as (2n = М) the maximum discrimination of an item, 

given the difficulty (a = .50) of the item, may be written as 


b* = 2a; (а 5 .50). (4) 


When the observed difficulty of an item is less than or equal to .50, the 
maximum discrimination of the item is just two times the difficulty. 

Assume that a > .50. Then less than one-half of the individuals 
responded negatively to the item. Implicitly 


wi + wa «n € nt rs (5) 


For purposes of determining a maximum discrimination index, b*, it 
may be assumed that all negative responses occurred in gz. Thus, r:* is 
assumed to be a maximum, 7% = n, and w;* = о. The value for ғ” may 
be computed as 


rot -(ntr)-n. (6) 
Given the values for r,* and r;*, the maximum discrimination of an 
item with a difficulty greater than .50 may be defined specifically as 


_ 21 — (n Tr». (7) 


п 


b* 
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In terms of observed difficulty 
b* = 2(1 — а); (a > .50). (8) 


When the observed difficulty of an item is greater than or equal to .50 
the maximum discrimination of the item is equal to twice the propor- 
tion of negative responses. 

When the difficulty of an item is less than .50 there are fewer positive 
responses than negative responses and the maximum discrimination 
index is a function of the negative responses. Alternatively, when the 
difficulty of an item is greater than .50 there are fewer negative 
responses than positive responses and the maximum discrimination in- 
dex is a function of the negative responses. 

As the difficulty index of an item deviates from .50, either above of 
below it, the maximum ceiling of the discrimination index is reduced 
from unity. For each metric unit of deviation from .50 for a difficulty 
index there is a two metric unit reduction from unity for the maximum 
ceiling of the associated discrimination index. Thus given a difficulty 
index of a its absolute deviation from .50, |] , may be used to compute 
the ceiling or maximum possible discrimination index, in absolute 
value terms |94, of the associated discrimination index. 


14 -|.50- d 9 
and 


М = 1.00 = 24. (10) 


Logically the principle involved in computing maximum 
discrimination is presented by equations 9 and 10, however the 
Pragmatics of the concept are obscured by the equations. Equations 4 


and 8 Tepresent a more reasonable set of equations for computing 
maximum discrimination, 


A Cartesian Co-ordinate System Defined by 
Difficulty and Discrimination 


Geometrically Maximum discrimination has a perfect curvilinear 
relationship to difficulty within the four quadrants of a two dimen- 
sional space. Within any one quadrant maximum discrimination is 
linearly related to difficulty. Because of an isomorphism between 
quadrants, the nonadjacent quadrants are reflections of each other and 
any pair of adjacent quadrants may be used to depict the relationship 
between difficulty and maximum discrimination. 


In Figure 1, two adjacent quadrants of a Cartesian coordinate 


m 
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7 Figure 1. General Cartesian co-ordinate system defined by difficulty and discri. 
tion. 


mina- 


‘System have been depicted. The abscissa of this system represents a 
difficulty continuum while the ordinate of the system represents а 
‘dual signed" discrimination continuum, the values may be in- 
_terpreted as either positive or negative. 
_ The origin is denoted on the difficulty continuum as .50 so that any 
‘Movement along the continuum will represent directed deviations 
Е from .50. The dashed line of demarcation within the “left quadrant" 
Tepresents the line that is defined by any set of coordinates (a, 0%) 
_ Where (a = .50) and b* is a maximum discrimination value defined on 
the discrimination continuum. The dashed line in the “right quadrant” 
15 the line that is defined by any set of coordinates (a, b*) where (а = 
50) and b* is a maximum discrimination index, а value defined on the 


discrimination continuum. 2 
. In Figure 2 the terminus of an item vector, k, has been plotted with 


‘Tespect to the difficulty of the item, а, and the discrimination of the 
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Figure 2. Item vector K defi i i i ctor K* 
defined by co-ordinates (a, bn). ned by co-ordinates (a, 6) and ideal item vector 


item, b. A second item vector, k*, has been plotted with respect to the 


She an of the item, a, and the maximum discrimination, 2%, of the | 


BOUT iis obtained. That is, the ratio of the two areas may be 
gat Of as representing the efficiency of the item. The better an item 


ain the more Closely will the ratio of the two areas approach — 4 
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Technically, the area of the observed triangle is a function of the 
types of discriminations the item makes. In referring to Table | it may 
be observed that (r, + r2) individuals are judged as being better than 
(wı + w2) other individuals. However, the item functions as though [(7; 
+ т) (м, + w;)] dichotomous discriminations are made. Тһе max- 
imum number of discriminations which may be made for a given item 
is, (N/4). 

Consider the component parts of the equation 


(Fı + Fa) (Wy + wa) = пи + We + Изм, + гм; (11) 


then it is possible to consider the concept of “proper” and "improper" 
discriminations. Let the term proper discriminations refer to those 
point discriminations which are desirable in the sense that they result 
through a maximizing of the frequency of one particular response 
type, positive response, in one particular group, group one, while the 
other response type, negative response, is being maximized in the other 
group, group two. The term improper discrimination may be as- 
sociated with those point discriminations which are not desirable in 
the sense that they occur as a result of undesirable response types oc- 
curring in both groups. For any item, the number of proper dis- 
criminations is characterized by (r,w2) and the improper discrimina- 
tions by (ғи). Implicitly for easy items the negative responses should 
all be accrued by group two, ws, and for difficult items the positive 
Tesponses should all be accrued by group one, 7). 

A discrimination index іп terms of point discriminations, {b}, is just 
the difference between proper and improper discriminations 


{b} = (тм) = (ғам) (12) 
and the relative discrimination index is given by 


_ 4418}. (13) 
Ат 


Тһе maximum discrimination index, however, assumes no improper 
discriminations. Either г; ог w is assumed to be zero and either 7) ог w2 
is assumed to be n. Thus, maximum discrimination in terms of point 
discriminations, (9%), represents the maximum number of proper dis- 
criminations possible for a given N and difficulty index. 


{b*} = (r, + ra); a = .50 (14a) 


or 


{b*} = (м + wa)n; a > 50 (14b) 


628 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


and the relative maximum discrimination index is given by 


w= a (15) 


Given that the area of any triangle is equai to one-half the base mul- 
tiplied by the altitude and given that both triangles in Figure 2 have 
the same base then the ratio of their areas is equivalent to the ratio of 
their altitudes. Alternatively, the ratio represents the number of 
observed proper discriminations less observed improper discrimina- 
tions divided by the maximum possible number of proper discrimina- 
tions. The ratio of {b} to {b*} will range from zero to unity, assuming 
{b} is positive, and may be thought of, conceptually, as representing 
the “purity” of the discriminations made or the efficiency of the item. 
Let e represent a general efficiency index then: 


seul Bj. 
s ТЫ (16) 
and іп modified form: 
b 
т (17) 


Тһе value Гог {b*} will always be positive and {6} may be positive or 
negative, thus, e as defined by equations 16 and 17 may be positive or 
negative. When e is negative it is negative because more improper than 
proper discriminations were made. The terms proper and improper 
Were somewhat arbitrarily assigned to two quantities on the assump- 
tion that more positive responses and, hence, fewer negative responses 
would always be made by group one relative to group two. For in- 
terpretations within the framework of proportions and areas, the sign 
of emay be neglected. The negative sign of e becomes meaningful only 
within the framework of probability. 


Given the conditional magnitude of the difficulty index, the general 
е may be further specified as: 


e = za < .50 (18) 


general efficiency for items ha: 


БОЛД ving difficulty indices less than or equal 


= — b Sat 
"s a 5259 (19) 


RICHARD J. HOFMANN 629 


general efficiency for items having difficulty indices greater than or 
equal to .50. 

Certain initial observations may be made with respect to e and 

proportion interpretations. 

(a) If the observed discrimination index of an item is zero then the 
efficiency of the item is zero. 

(b) For any level of difficulty, excluding zero and unity, it is 
theoretically possible for e to range from zero to unity assuming 
a positive discrimination index. 

(c) Efficiency is the ratio of observed proper discriminations less 
improper discriminations to the maximum possible number of 
proper discriminations for a given difficulty level and group size. 

(d) The general index е is indicative of how well an item has func- 
tioned relative to how well it might have functioned for a given 
N and specific difficulty level. 


Probability Interpretations of Efficiency 


In the previous section it was noted that the general efficiency index 
could be subdivided into two indices, one for items having difficulties 
less than or equal to .50, henceforth efficiency of the first kind, e, and 
one for items having difficulties greater than or equal to .50, 
henceforth efficiency of the second kind, ез. The indices of efficiency 
may be further utilized to make probability interpretations with 
Tespect to positive responses and with respect to negative responses. 

Equation 18 defining 6), for items having difficulties less than or 
equal to .50, may be modified to define a computational equation for 
е, regardless of item difficulty and sign of the discrimination index. 

n — "т 
aT ntn Ge 
Similarly, equation 19 defining es, for items having difficulties greater 
than or equal to .50, may be modified to define a computational equa- 
a for е; regardless of item difficulty and sign of the discrimination 
Index, 


сыын 21 
и, + We gn 
Utilizing equations 20 and 21, it is possible to discuss conditional 
Probabilities and note that quite unlike traditional U-L discrimination 
Indices, efficiency considers two events which are mutually exclusive 
and exhaustive with respect to a given sample space. д 
‚ Assume that a positive response has been made to an item. Given an 
individua] making a positive response, the probability that the in- 


€» — 
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dividual is a member of g, is given by P(g, | 7) while the probability 
that the individual is a member of g, is given by P(g; | r), where: 


mu (i 
P(g, | r) = Pla (22) 
and 
Ра, | 1) = > a (23) 
LI 2 
Then 
e, = P(g,| ғ) – Р(в,| ғ) (24) 


efficiency of the first kind is the difference between two conditional 
probabilities, where the probabilities are for group membership given 
a positive response. 

Assume that a negative response has been made to an item. Given 
an individual making a negative response to an item, the probability 
that the individual is a member of 81 is given by P(g, | w) while the 


Probability that the individual is a member of 8; is given by P(g: | w), 
where: 


= Wi 
P(g, | w) = WE, Q5 
and 
x We 
P(g, | w) = a (26) 
Then 
€; = P(g;| w) — P(g,| w) (27) 


efficiency of the second kind is the difference between two conditional 
Probabilities, the Probability of group membership given a negative 


response. 
Efficiency of the first kind and efficiency of the second kind are both 
mutually exclusive and exhaustive with respect to sample. space: 
e = Ра | )— |); c 
10 = Ра, | p) + Pig, | py qn 
& Ра, | ww) — Pig, | wy 2 
10 = Plg, | w) + Pig, | w). en 


Pigs ij 22 be noted that e, and e, are proportional to each other. 
е ratio of e; to e; represents the odds in favor of a negative response 
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while the ratio of е, to e; represents the odds in favor of a positive 
response. 


The Probability of Obtaining an Observed e by Chance 


The general e index has been discussed with the framework of e; and 
e. It was noted that for any given level of difficulty, e may range from 
zero to unity. Quite logically one would like to know the probability of 
obtaining an observed e for any given index of difficulty. Generically, 
what is a significant e? 

The model contingency table from which e is computed is unique 
within the framework of statistics. Theoretically, е is a measure of 
departure from independence in the contingency table. However, there 
is a different probability distribution for e associated with each 
uniquely different sample size, N, and each uniquely different difficulty 
level, a. In order to compute the probabilities associated with any 
given e it must be assumed that all four marginals of the contingency 
table are fixed. That is to say, the probabilities reported for any e are 
determined from the specific probability distribution of e associated 
with a given N and a. 

Technically, in order to test the null hypothesis of independence, (e 
— 0), it is necessary to compute the probability of obtaining the 
Observed e and all possible e indices of a larger magnitude assuming 
constant marginals. Although Pearson's (1932) Chi square test might 
be used with this mode of a contingency table, it was not designed 
specifically for such use and in using it one would have to constantly 
keep in mind the consequences of its use with small sample sizes and 
also meet the assumptions of expected frequencies greater than five in 
the cells of the table. 4 

А test designed specifically for the type of model contingency table 
associated with e is Fisher's (1935) exact test. Essentially, Fisher's test 
would indicate the exact probability of obtaining an observed e given a 
particular N and a. Furthermore, it could be used to compute the exact 
Probabilities of each associated e greater than the observed e. In sum- 
ming up all of these exact probabilities, one would have the 
Probability of obtaining an e as large or larger than the one obtained 
for the given level of difficulty, a, and group size, N. Unfortunately, for 
апу test having more than four or five items ог fifteen or sixteen in- 
dividuals, such an approach would be extremely time consuming. 

However, it is possible to use a variation of Fisher's test, which is 
based upon the hypergeometric distribution, to establish the 
magnitudes of the e indices which would represent the extreme percen- 
tage of such indices for various difficulty levels and group sizes. Thus, 
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it is possible, so to speak, to establish “tables of significance" for thee 
index by computing probabilities associated with extreme values for e 
and work, computationally, toward the less extreme values. 
Assuming Table 1 as a model, the total number of ways in which the 
table can be obtained, while maintaining fixed marginals is given by 


(N!) 


ud MOL. WW. , 
Ее 


(32) 


which represents the product of the number of ways of taking (ғ + т) 
responses from N, multiplied by the number of combinations of n in- 
dividuals taken from the total N individuals. There are [N!/(ri! ғ! wı! 
Wa!] ways of obtaining the cell frequencies in Table 1. Thus, the exact 
probability of obtaining the observed table frequencies and hence e 
may be computed as the ratio of the number of ways of obtaining the 
cell frequencies to the number of ways of taking (ғ) + ғ) from N, and 
then multiplied by the number of combinations of n individuals taken 
from the total N individuals which is: 


P(e) = (1 + ra)! (wı + wa)! (nn! 


Ti! ro! wı! wa! М! 


(33) 


In equation 33 it is important to note that the probability of either ei 
ог еҙ is the same for any table inasmuch as the equation is associated 
with all four cells. 

Assume that the difficulty of an item is less than .50, then it is possi- 
pie to develop from equation 33 a computational algorithm for 
determining the significance of e-indices. Only the cell values 71, 7» Wı 


and w, will change; thus, define as x that aspect of equation 33 that re- 
mains constant. 


pori (n + ra)!(w + wo)!n!n! (34 
N! 


oe ane us denominator of 33 excluding МІ, that would be as- 
Е. d most extreme contingency table for the given № and 
5 of e associated with the most extreme contingency table 

€ unity. The probability of this value for the most extreme 


table may be determined б i 
2 е rom x and y,. I bi 
tion for y, is given by тетя 


Jı = (rı + ra)! (0)! — (r, + ra)]! (л)!. (35) 


The exact probability of a tabl 


unity given the particular N 
thate =е,, 


е occurring with an efficiency index of 
2211 and ais x/y,. (For continuity it is assumed 
= 2%, 0wever, equations 35-40 may be used for e = €2 
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by simply substituting Wı for r, and м» for г.) Compute the y-value for 
the subsequently more independent tables following the com- 
putational procedure 

yen tn DO! — mtr Da- D! (86) 


DOn- ( + r — 2([! ) — 2)!‏ — رم + (т‏ = ور 


з= [n + ra — i — DONG Dn — (n 
+n- G- Da- U- 1))!. 
In а special sense, when using Fisher's (1935) exact test, one first 
computes у; and then у;-1 and so on to y,. Essentially the exact prob- 


ability, о, of obtaining the cells of an observed contingency table or 
some worse departure from independence may be computed by 


a= (37) 


Implicit in Fisher’s test is the assumption of finite sample size. To 
establish an a-level and then attempt to determine the associated е- 
value is tantamount to assuming an infinite sample size. The assump- 
tion of finite sample size is a restriction that necessarily exists because 
equations 33, 34, 35, and 36 use discrete numbers. Thus, for апу sam- 
ple of “modest” size, say N = 60, one is faced with the task of comput- 
ing extremely large factorials. s 

When two of the fixed marginals are identical, as they are in Table 1, 
the computation of probability levels and associated e-indices is greatly 
simplified. There are two rather compelling properties associated with 
such a contingency table. First, the hypergeometric probability dis- 
tribution associated with such a table will be symmetric. Secondly, the 
most extreme departure from independence is immediately known, at 
least one of the four cells will be zero and the efficiency index will be 
unity. ү 

Within the framework of this paper one may think of Fisher's exact 
probability test as determining the exact probability of obtaining an e- 
index as large or larger than the one observed. However, through an 
inexact procedure it is possible to establish a critical probability, say a, 
and then compute the discrete e-index whose exact probability, a’, 15 
the best estimate of о given the restriction that, (о! За), the observed 
probability level may not be larger than a. An approach such as this 
would indirectly facilitate the use of much larger, greater than 60, but 


still finite sample sizes. 
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Assume (о); then x by equation 34, and yi by equation 35. If x/y, < 
о then compute у, and so оп following equation 36 until for the jth 
and (j+ 1) values for y the following inequality occurs. 


x За < 


У». У». 


ізі i=l 


(38) 


The above inequality implies that the probability of the e-index as- 
sociated with the cells of the jth contingency table is less than or equal 
10а while the probability of the e-index associated with the cells of the 
G + 1)th contingency table is greater than а. The following two equa- 
tions may be used to compute a’ and e' respectively: 


=, (39) 
х У; 
е-1-24-Л (40) 


ntn’ 


necessary to substitute w, for л, and w, for rain equations 35-40. Alter- 
natively, one tail of the distribution is associated with e, while the 
Other tail is associated with ег. One may consider just e, and use the 
ed e, to its associated еҙ value and use 


probabilities are computed inasmuch as (е, = ej. (A table of .05 


T 0 evels of difficulty for samples ranging in 
Size up to 100 is available from the author,) à 


Total Test Efficiency 


ust as one can talk of total test difficult Iso. 
е у, SO, also, 
can one talk of total test efficiency. In this section, the equations for 


computing total test efficiency will be discussed. No attempt will be 
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"^ made to interpret the total test efficiency index other than the cursory 
“definition that follows from it computationally. 
1 Assume some test composed of j items. Then №,’ individuals will res- 
| pond to item j. The number of individuals in either g, ог g; will be 
- denoted as л, for the jth item and (2n, = М/). The difficulty and dis- 
crimination indices for the jth item may be denoted as a, and d; respec- 
tively. The total number of possible discriminations that may be made 
— by the jth item is №2/4. The absolute maximum frequency of proper 


discriminations that is possible for the jth item, 6,*, is given as: 
{b} = Ма, (41) 


The frequency of proper discriminations less improper 
discriminations, {b}, for the jth item is given as: 


{by} = nbs. (42) 


Inasmuch as efficiency is defined, at an item level, as the ratio of 
_ Observed proper discriminations less improper discriminations to the 
- maximum possible number of proper discriminations, for a given diffi- 
- culty index and group size, assume a similar definition for total test 
D efficiency. Let total test efficiency be represented by the ratio of total 
© Observed proper discriminations less improper discriminations to the 
| maximum possible number of proper discriminations for the total test. 
Let E, represent total test efficiency, assuming that the difficulty of all 
items is less than .50, then 


> nib; 


(43) 


Еу = -A : 
2 » па 
i=l 
Such an index is indicative of the proportion of “quality” dis- 
criminations made by a total test given that all items have a difficulty 
less than .50. For items having a difficulty index greater than .50 equa- 
tion 43 is modified to determine Е; as 


; ibi 
> в ў (44) 


PEE 
2 >) п(1- а) 


И the j items form a mastery test, then either equation 43 or 44 could 
be used to compute total test efficiency. If thej items form an achieve- 
| ment test, assume that s of the items have difficulty indices less than .50 

and assume that k of the items have difficulty indices greater than or 


E, = 
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equal to .50, then if the items are grouped dichotomously according to 
those items having difficulty indices less than .50 and those equal to or 
greater than .50 the general efficiency index, E, of the achievement test 
is given by: 


aol = қ (45) 


[> па; + » п(1- 2] 


i=l i=l 


Experience with this index is still growing and, therefore, its applica- 
tion to total tests is only of theoretical interest at this time. However, 
the systematic study of E, E, and Е, might be informative. To wit, 
either E, or E; might serve as indicators of discriminatory of difficulty 
homogeneity or perhaps as some sort of index of internal consistency 
for a mastery test. The general index, E, might serve as an index of in- 
ternal consistency for an achievement test. 


A Pragmatic Scheme for the Use of e 


It is possible to set up subjective criteria for determining easy, 
moderate, and hard items as well as nonefficient, efficient, and ideally 
efficient items. Let the following inequalities, based upon efficiency, e, 
and difficulty, a, serve as operational definitions of the above terms. 


0.00 < e < 0.50 nonefficient item 

0.50 5 е < 0.80 efficient item 

0.80 = e < 100 ideally efficient item 
075 Sa = 1.00 — easy item 

0.25 = а = 0.75 moderate item 

0.00 = а = 0.25 hard item 


Because exact intervals have not been made these definitions are, 
theoretically, not mutually exclusive but for practical purposes they 
may be thought of as mutually exclusive and exhaustive with respect to 
difficulty and efficiency. Utilizing these definitions, one may cate- 
gorize all items of an instrument into one of nine subjective item types, 
e.g., nonefficient easy items, 
Based upon this categorization it is possible to construct within the 
framework of Cartesian coordinates a chart for determining item 
quality. In Figure 3 such a chart has been constructed. Given the diffi- 
culty and discrimination of an item as coordinates, it is possible to 
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t50 


RELATIVE DISCRIMINATION 


t 
ay NON-EFFICIENT 


0 .25 75 1.00 


.50 
HARD MODERATE EASY 
RELATIVE DIFFICULTY durs 
Я Figure 3. Ісопіс representation of contingency categorization of subjective item 
урез. 


locate the associated item point. И the point falls within the working 


area, then the item is a working item. Additionally, it is possible to 
— determine the relative difficulty of the working item. Although this 
_ chart was constructed as a function of the e-index, its use actually 


precludes the computation of such an index. Исан G 
Assuming Figure 3 to be triangular probability distribution, it is 
possible to briefly discuss the chance probability of obtaining different 
item types. Note in Figure 3 that there are nine different item types. 
The probability of any item types occurring by chance тау be deter- 
mined as the ratio of the surface area associated with each item type to 
the total surface area of the probability distribution. The probabilities 
for the nine subjective item types are reported in Table 2. 

As previously noted, the concept of item types is subjective as are 
the arbitrary numerical criteria for defining them. It is possible to con- 
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TABLE 2 
Chance Probabilities of Subjective Пет Types 
T—————————————— 


Difficulty Ideally 

Levels Nonefficient Efficient Efficient Total 
Easy 10625 0375 0250 ‚1250 
Moderate .3750 2250 .1500 7500 
Нага 0625 0375 0250 .1250 
Total .500 .300 .200 1.0000 


struct a more detailed chart than Figure 3, as well as a more detailed 
table of chance probabilities to go with it. 


The e-index and Certain other Indices 


If certain assumptions pertaining to substantial sample size are met, 
the e-index may be converted to a Pearson x? statistic. The problem of 
sample size is closely related to the problem of small sample size expec- 
tancies in the 2 X 2 table. Failure to meet Pearson's (1932, 1916) as- 
sumption of sample size will result in a statistic following some hyper- 
geometric distribution rather than a chi-square distribution. Consider 
the marginals of Table 1, one set of marginals is always fixed, within a 
proportional framework, at n/N. The other set of marginals, within а 
proportional framework, vary between zero and unity defining the 
item difficulty and its compliment. Let five be the lower bound for ех- 
Pected cell frequency, then 5/N would represent the expected cell 
proportion. Assuming independence an expected cell proportion may 
be computed as the product of the two marginals associated with the 


cell. The minimum difficulty of an item for a valid use of a Pearson chi 
Square test is then given by 


oz 1. 

LSN, 
Assuming an item has a difficulty greater than or equal to 10/N and 
less than or equal to (1 — 10/N) a Pearson chi-square statistic, actually 


a test of homogeneity, may be applied to the item’s contingency table. 
Specifically, the conditional inequality 


10 10) (47) 
ys ( ibm 

N is sizable. Assuming these conditions have 
Pearson chi square statistic may be computed 
Or еҙ, with 1 degree of freedom. 


(46) 


must be met, assuming 
been met then x?, the 
directly from either е, 
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х = ave 2 ) = ones! = : (48) 


A second index which is closely related to efficiency, at least 
algebraically, is the phi index. More specifically, the ratio of the phi in- 
dex to its maximum possible value phi-max, as defined by Cureton 
(1959) is algebraically identical to the efficiency index. 

Given the equation defining phi as a function of x*: 


it is possible by simple substitution in equation 48 to define phi as a 
function of efficiency. 


va А 
фт “( » -) т al! a 2) (50) 


Following Cureton (1959), it is also possible to define the maximum 
value for ф from Table 1 within the context of difficulty. Just as a dis- 
tinction between e; and е: occurred as a function of the difficulty being 
greater than or less than .50 so also must this distinction be made for 
the maximum phi coefficient. 


1/2 
фе (-“-) ;а < .50; (51a) 
bus = (==); а> 550. (51b) 


In forming the ratio of phi to its maximum value, two equations 
result as a function of difficulty. The first equation results as a function 
of (a = .50) and is 

2 е (52) 
d its maximum. The second 


just a redefinition of e, in terms of phi an 
ratio being a function of (а = .50) is just 


—— =e (53) 
max 
a redefinition of e, in terms of phi and its maximum. 
It is important to note here that the computation of dmax must fol- 
low 51 if the item difficulty is less than or equal to .50 and the com- 
putation of @max must follow 51 if the item difficulty is greater than or 
equal to .50. 
Although there is an apparent close algebraic relationship between 
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efficiency and the ratio (Ф/фтах), the concept of general efficiency, the 
use of both e; and e;, would appear to be more meaningful than the 
ratio (6/ómax). We also find ourselves hard pressed to interpret general 
efficiency within the framework of a Pearsonian correlation. Finally, 
based upon Carroll's (1961) discussion of (¢/@max) it would seem most 
prudent to caution against forcing efficiency into the framework of the 
phi coefficient. 


Conclusion 


This paper has presented the basic aspects of the efficiency index. 
Whether or not the index will replace or supplement the traditional 
discrimination index remains to be seen. However, it would seem that 
the general efficiency index has many more statistically compelling 
properties than the traditional discrimination index. 
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FACTOR STRUCTURE OF THE McCARTHY SCALES 
AT FIVE AGE LEVELS BETWEEN 2% AND 8%! 


ALAN 8. KAUFMAN* 
The Psychological Corporation 


The McCarthy Scales of Children’s Abilities (MSCA) were factor 
analyzed at буе age levels: 2%, 3-31, 4-4%, 5-5%, and 6%-7%-8. 
The standardization sample (N — 1032) provided the source of data. 
Varimax rotated factors akin to four of the six MSCA Scales— 
General Cognitive, Verbal, Memory, and Motor—appeared at age 
2%, and tended to appear at all older age levels. Factors akin to the 
Perceptual-Performance and Quantitative Scales emerged at ages 
3-3% and 5-544, respectively. The overall findings were interpreted 
from a developmental perspective, and the data were shown to offer 
evidence for the construct validity of the MSCA. 


THE McCarthy Scales of Children’s Abilities (MSCA) comprise 18 
short mental and motor tests which have been grouped into six scales: 
Verbal, Perceptual-Performance, Quantitative, General Cognitive, 
Memory, and Motor. McCarthy (1972) selected the scales primarily on 
the basis of functional and intuitive considerations, although the 
results of preliminary factor analyses of parts of the MSCA standard- 
ization data were also considered. These analyses of the standardiza- 
tion edition of the MSCA, though exploratory in nature, served a 
number of other useful functions. They suggested that the battery (1) 
has a Strong underlying structure, as evidenced by the consistency of 
the factor patterns for each set of data when several different tech- 
Niques of factor analysis were applied; (2) had a somewhat similar 
Structure at three different age levels; and (3) measures some abilities 


—_—_ 
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which are similar to the abilities assessed by conventional intelligence 
tests, and others which seem to add uniqueness to the field of chil- 
dren's testing (Kaufman and Hollenbeck, 1973). 

The purpose of the present study was to analyze the final version of 
the MSCA (which is somewhat shorter than the standardization 
edition) to provide a more definitive picture of the MSCA’s factor 
structure. The availability of data for the entire standardization sam- 
ple made it possible to have a relatively large ratio of the number of 
subjects to the number of variables in each of the present analyses. 
Such ratios help insure the stability of the resulting factor patterns, 
and therefore permit more meaningful comparisons of the factor 
structure at different age levels. 

In addition to gaining this developmental perspective of the MSCA, 
and relating the results to existing theory and research, a second major 
purpose of the study was to evaluate the construct validity of the Mc- 
Carthy Scales; as Anastasi (1968, pp. 114-120) indicates, factor 
analysis is one of the acceptable techniques for providing evidence of a 
test's construct validity. Since McCarthy’s (1972, р. 2) main goal in 
Structuring the scales was to develop a clinically useful instrument 
(rather than a factorially pure test), the construct validity of the 
MSCA certainly does not rest on data obtained from factor analysis. 
Nevertheless, a close correspondence between the factor structure and 
the chosen scales will enhance the instrument's validity, and should 
make the scores more meaningful to the clinician. 


Method 


Instrument 


As indicated, the McCarthy Scales include a number of tests which 
have been grouped into six scales. Figure 1 provides a schematic 
illustration of the content of each scale and the interrelationship that 
exists among the scales. 
phon component tests are described in detail in the manual (Mc- 

arthy, 1972), and are merely summarized here: (1) Block Building— 
copying structures made out of cubes, (2) Puzzle Solving— putting 
together cut-up pictures, (3) Pictorial Memory—recalling pictures €x- 
Livi: ipie (4) Word Knowledge—Picture Vocabulary (Part 1) and 
aa ih ulary (Part ID, (5) Number Questions—number facts and 

га’ problems, (6) Tapping Sequence—repeating sequences tapped оп 
ee (7) Verbal Memory—repeating words and sentences 
| art T) and retelling a story (Part II), (8) Right-Left Orientation— 
nowing right vs. left, (9) Leg Coordination—motor skills such a8 
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GENERAL COGNITIVE 


PERCEPTUAL-PERFORMANCE 


QUANTITATIVE 


1. Block Building 


VERBAL 


2. Puzzle Solving 5, Number Questions 


16, Counting and Sorting 


Verbal Fluency 


8. Right-Left Orientation 


18. Conceptual Grouping 


Figure 1. Тһе grouping of the 18 MSCA tests into six scales. The tests in the Verbal 
- (V), Perceptual-Performance (P), and Quantitative (Q) Scales are-combined to form the 
General Cognitive Scale. Each of the Memory tests is also included on either the V, P, or 
9 Scale, and hence on the General Cognitive Scale. The Motor Scale contains three non- 
Cognitive tests which belong exclusively to it, and two tests which are shared with other 
| Scales. (Thanks are due Mrs. Fay B. Krawchick for designing the figure.) 


Walking backwards, (10) Arm Coordination—bouncing a ball (Part I), 
Catching a beanbag (Part 11), and throwing a beanbag at a target (Part 
ОШ), (11) Zmitative Action—simple motor skills such as clasping hands, 
(12) Draw-A-Design—copying designs, (13) Draw-A-Child— drawing 
а child of the same sex, (14) Numerical Memory—repeating digits 
forwards (Part I) and backwards (Part II), (15) Verbal Fluency— 
naming as many "things" in several categories as possible in 20 
Seconds, (16) Counting and Sorting—simple number concepts, (17) 
Opposite Analogies—providing opposites to complete analogies, (18) 
Conceptual Grouping—logical classification. 


1 Subjects 


The standardization sample of the MSCA, which includes 100 to 
106 children at each of 10 age levels between 2/2 and 8% (Total N = 
1032), provided the source of data. At each age level there were an 
| qual number of boys and girls and a proportional representation of 
m. Whites and nonwhites in accordance with 1970 Census data. A detailed 
; description of the sample and the stratification variables appears in the 
| Manual (McCarthy, 1972). For the present analyses, the sample was 
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divided into the five groups shown below: 


Age Levels N 
2% 102 
3-34 204 
4-45 206 
5-5% 206 
62-72-812 314 


Procedure 


The tests constituting the MSCA were the variables studied. Four of 
the tests include 2 ог 3 parts and, in general, these parts were treated as 
separate variables in the present factor analyses. Whenever a test or 
part of a test produced virtually no variation in test scores for a par- 
ticular age group (e.g., nearly every 2% year-old scored 0 on Draw-A- 
Child), then that variable was excluded from the analysis for the age 
group in question. Word Knowledge was an exception to both of these 
rules. For this test, the total of Parts I and II was entered in each 
matrix since Part I produced little or no variation in test scores at all 
but the youngest age levels. 

The results of the exploratory factor analyses of the standardization 
edition of the MSCA (Kaufman and Hollenbeck, 1973) helped to 
guide the procedure for the present investigation. Since the several 
different factor analytic techniques used for each set of data in the 
previous study gave consistent results, the use of only one technique 
Seemed sufficient for the present analysis. In addition, the use of objec- 
tive methods for determining the number of factors led to the extrac- 
lion of a few apparently trivial factors. Therefore, for the present 
analyses, the solutions which made the most psychological sense were 
selected. 

For the analyses, correlation matrices were obtained for the five age 
groups (which will be referred to as ages 2%, 3, 4, 5, and 6+). Then, 
each matrix was subjected to principal factor analysis, with squared 
multiple correlations in the diagonals, followed by varimax rotation of 

-, 4-, 5-, and 6-factor solutions. 


Results 
Factor Structure at the Fipe Levels 
The following solutions seemed to make the most psychological 


Sense and were, therefore, selected: 4-factor at age 2%, 5-factor 
at age 5, and 6-factor at ages 3, 4, and 6+. Loadings of .25 and above 
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were interpreted as meaningful and only these loadings are included in 
the tables. 

Age 2. (See Table 1.) As is shown, the four factors were called Ver- 
bal, Motor, General Cognitive, and Memory, respectively. Factor II is a 
clear motor factor as each of the three gross motor tests had loadings 
of about .60. The high loadings of Pictorial Memory and Verbal 
Memory on Factor IV suggested that much of the ability involved is 
memory. Factors I and III are both cognitive in nature and each ac- 
counted for 27% of the common factor variance. Factor I was called 
Verbal because the tasks with the highest loadings involve verbal 
ability, and all tasks with loadings of .25 or better (except for Arm 
Coordination), require either verbalization by the child or comprehen- 
sion of verbal questions and commands. Factor Ш was called General 
Cognitive because all of the tasks with high loadings require concep- 
tualization, whether verbal, nonverbal, or quantitative. The only tasks 
which did not have meaningful loadings were the ones assessing either 
gross motor coordination or simple rote memory. 

Age 3. (See Table 2.) Factor I was called General Cognitive, as most 
of the cognitive tasks in the battery had substantial loadings. Although 


TABLE 1 
Varimax Rotated Factor Matrix of the MSCA Tests at Age 2% 
Factor 
I II ш IV 
General 
Test Verbal Motor Cognitive Memory 
Block Building 30 42 45 
Puzzle Solving 48 35 
Pictorial Memory .62 
Word Knowledge I + П 34 25 49 29 
Number Questions 4l 31 30 25 
Tapping Sequence 44 28 
Verbal Memory I 38 154 
Leg Coordination 61 
Arm Coordination 43 ‘59 
1+ IL HI 

Imitative Action 58 
Draw-A-Design 32 37 
Numerical Memory 1 57 
Verbal Fluency 58 4l 38 
Counting and Sorting 62 
Opposite Analogies .65 38 
Conceptual Grouping 40 157 30 
% of Com. Fact. Var, 27 26 21 20 


Note.— Only loadings of 25 and above are included. 
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TABLE 2 
Varimax Rotated Factor Matrix of the MSCA Tests at Ages 3-3 
(М = 204) 
Factor 
I п ш IV V УІ 
General Perceptual- 
Test Cognitive Motor Verbal Memory Drawing Performance 
42 

Block Building 25 26 E 
Puzzle Solving 32 A8 36 
Pictorial Memory. 64 30 
Word Knowledge I + II 63 .38 
Number Questions 15 | 
Tapping Sequence 32 33 
Verbal Memory I 26 40 .54 
Verbal Memory П 26 .68 
Leg Coordination 37 
Arm Coordination I 61 
Arm Coordination II 65 30 
Arm Coordination III 30 52 
Imitative Action 30 28 26 
Draw-A-Design 25 43 43 
Draw-A-Child 39 ap 
Numerical Memory 1 Al 27 21 32 
Verbal Fluency .63 
Counting and Sorting 37 54 
Opposite Analogies .50 44 .30 
Conceptual Grouping .50 45 29 
% of Com. Fact, Var. 26 16 23 18 9 p 


Note,—Only loadings of .25 and above аге included, 


a few of the gross motor task: 
that the very high loadings 
Number Questions and Word 


$ had loadings of .25 or above, it is clear 
belonged to conceptual tasks such as 


Knowledge. Factor II was interpreted 
as a Motor factor, and Factor Ш as a Verbal factor. Factor IV was 


called Memory, as Verbal Memory I and П had the highest loadings 
and all tests of short-term memo: 


Factors V and VI eac 
volving perceptual- 
content and was (ег 
loadings, Factor 
more varied with 
Conceptual Grou; 
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: TABLE 3 
Varimax Rotated Factor Matrix of the MSCA Tests at Ages 4-4 
(N = 206) 


Factor 
I Il ш IV У УІ 
Perceptual- Semantic 
Test Drawing Motor Memory Verbal Performance Memory 
lock Building 44 
leSolving 33 26 38 
torial Memory 49 
lord Knowledge 49 40 27 
umber Questions ds .26 
lapping Sequence 28 .39 .50 
lerbal Memory I 67 30 
bal Memory И 40 .26 .38 
Coordination 2 
т Coordination | .50 
ип Coordination 11 .59 
m Coordination Ш 39 
litative Action 25 .28 
aw-A-Design .51 25 34 
‘aw-A-Child 56 25 26 
lumerical Memory 1 26 48 130 
тра! Fluency 45 139 .31 
unting and Sorting 3l 57 
pposite Analogies 59 
nceptual Grouping 32 399 34 
9f Com. Fact. Var. 14 13 22 21 20 10 


Note.—Only loadings of .25 or above are included. 


по verbal responses by the child. (The .57 loading by Counting and 
Sorting on Factor V is not inconsistent with the interpretation because 
this number task requires virtually no verbalization.) 1 

The remaining three factors all seem to deal with verbal ability, with 
two of them also involving memory. Factor IV was termed Verbal 
because of its close similarity with the Verbal factors identified at ages 
| 2% and 3. Factor Ш was called Memory since most of the tasks with 
high loadings involve short-term memory (Verbal Memory 1 & П, 
- Numerical Memory, Tapping Sequence). Finally, Factor VI was 
labeled Semantic Memory since the three variables with the highest 
loadings are all memory tasks which have a semantic content. 

Age 5. (See Table 4.) Factor I was given the name General 
Cognitive/ Verbal. This factor, which accounts for 3796 of the common 
factor variance, is certainly a general factor; however, since five of the 
Six tasks with loadings greater than .50 require verbal ability, the dual 
name was assigned. 

Factors IL, III, and V seem to be clear Motor, Perceptual- 
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TABLE 4 
Varimax Rotated Factor Matrix of the MSCA Tests at Ages 5-5% 
(№ = 206) К 
EAN pr 7 20001071090 Ol Pise ооо 
Factor 
1 п ш М ү 
Сеп. Сор./ Perceptual- m. 
Test Verbal Motor Performance Memory  Quantitativ 
Block Building 37 
Puzzle Solving 35 50 
Pictorial Memory 42 
Word Knowledge I + II 54 28 46 
Number Questions 40 27 ‘37 
Tapping Sequence 25 1 
Verbal Memory 1 64 
Verbal Memory И .54 
Right-Left Orientation 
Leg Coordination 34 32 
Arm Coordination I 53 
Arm Coordination II 57 
Arm Coordination Ш 44 
Imitative Action 26 6 | 
Draw-A-Design 56 2 
Draw-A-Child 30 .57 5 
Numerical Memory I .56 i 
Numerical Memory II 27 28 .34 s 
Verbal Fluency 51 33 7 
Counting and Sorting .34 43 4 
Opposite Analogies .58 5 
Conceptual Grouping 49 42 2 
* of Com. Fact. Var. 3 4 25 8 d 


Note,—Only loadings of .25 or above are included. 


Performance, and Quantitative factors, respectively. Factor IV was 
labeled Memory, although it is more specific than the memory factors 
found at the Younger age levels, and seems to involve visual memory. 
The large memory factor found at age 5 in the analyses of the standar- 
dization edition of the MSCA was highlighted by the substantial 
loadings of Numerical Memory I, Verbal Memory 1, and various tests 
of verbal ability (Kaufman and Hollenbeck, 1973). That factor seems 
to have merged with the General Cognitive factor in the present 
analysis. 


Age 6+. (See Table 5.) As with the analysis at age 5, Factor 1 was 


given the dual name General Cognitive/ Verbal. Factors II, Ш, IV, and 
VI are clearly identifia 


ble as Мотор, Perceptual-Performance, Memory, 
and Quantitative, respectively. Factor V was called Reasoning because 
the tasks with meaningful loadings tend to require the child to concep- 
tualize at a higher level than most other tasks in the battery. 
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TABLE 5 
Varimax Rotated Factor Matrix of the MSCA Tests at Ages 62-72-82 
(N 7 314) 
Factor 
I II IH IV у У 
Gen.Cog./ Perceptual 
Test Verbal Motor Performance Memory Reasoning Quantitative 
le Solving 26 326 45 39 
ictorial Memory 46 
Word Knowledge I + II .66 37 28 
umber Questions 36 .28 38 32 31 .29 
Tapping Sequence .30 .29 
Verbal Memory 1 .56 31 
erbal Memory II 58 
ight-Left Orientation 
eg Coordination 27 
rm Coordination I 57 
rm Coordination 11 :60 
tm Coordination Ш 49 
raw-A-Design 39 
raw-A-Child 36 49 
umerical Memory I :50 
umerical Memory II 25 A8 29 
erbal Fluency ЕУ 28 
unting and Sorting 26 42 
posite Analogies 44 137 27 29 
onceptual Grouping AS 
of Com. Fact. Var. 30 18 23 9 10 10 


Note—Only loadings of .25 and above are included. 


Discussion 


Factor analytic studies of the abilities of school-age children have 
appeared frequently in the literature, and although many factors are 
needed to explain the variety and complexity of children's behaviors, 
certain consistencies in the results are apparent. Kaufman and Hol- 
lenbeck (1973) pointed out the similarity of many of the factors 
found in the preliminary factor analyses of the MSCA to those typical- 
ly obtained from analyses of other test batteries such as the Stanford- 
Binet or WISC. This consistency may be further realized by relating 
the MSCA factors obtained in the present analysis at age 6+ to the 
Primary mental abilities of school-age proposed by Thurstone and 
Thurstone (1941, 1953). A comparison of the six MSCA factors at age 
6+ to the seven Thurstone factors that have been verified most fre- 
quently in research studies (Anastasi, 1968, рр. 329-330) reveals the 
following close correspondence between five of the abilities: 
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MSCA Factor at Age 6+ Corresponding Thurstone Factor 
Б General Cognituve/Verbal V. Verbal Comprehension 
Ш. Perceptual-Performance S. Space 
IV. Memory M. Associative Memory 
У. — Reasoning I(orR). Induction 
(or General 
Reasoning) 
VI, Quantitative N. Number 


In addition, the sixth McCarthy factor at age 6+—II. Motor—is 
similar to the Motor (Mo) ability identified by the Thurstones and in- 
cluded in their battery to assess coordination of eye and hand move- 
ments (Thurstone and Thurstone, 1953). 

Factor analyses at the preschool age levels, involving groups of sub- 
stantial size, have been far less common than analyses of school-age 
children. Nevertheless, ample evidence has accumulated to show that 
at the ages of 2 or 3, and even at various levels of infancy, there are 
specific intellectual abilities that emerge as group factors—sometimes 
in addition to a general factor (McNemar, 1942; Quereshi, 1967; 
Richards and Nelson, 1939), and sometimes in the absence of g (Hurst, 
1960; Meyers, Dingman, Orpet, Sitkei, and Watts, 1964; Stott and 
Ball, 1963). 

An area of extreme interest to many investigators has been the 
relationship of the abilities at the preschool levels to those at school 
age levels and, in particular, the developmental progression of these 
abilities. A number of studies have explored this topic by conducting 
Separate factor analyses of preschool and school age groups and com- 
paring the resulting structures. However, these studies have not 
provided definite answers to the developmental question for a variety 
of Teasons such as the following: the nature of the tasks was markedly 
different from age level to age level, even when the same instrument 
was used for all children (Stott and Ball, 1963); the age range of the 
different samples was not sufficiently broad, despite spanning 
preschool and School-age children (Hollenbeck and Kaufman, 1973; 
e and Lindsey, 1967); the tasks assessed only a limited range of 
children’s abilities due to the particular instrument studied (Quereshi, 


eres the specificity of the investigators" hypotheses (Meyers et 


The present analyses are thus of great importance to developmental 


theory. First, the age span of 244 to 8% years is quite broad, covering 4 
period of rapid and important behavioral change; moreover, the 
homogenizing influence of schooling is relatively minimal for children 
in this age range. Also, the nature of the test content is very similar 
throughout the age range. Although the wide age span necessitates 
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having a few tasks that are useful only for very young or older children 
within the range examined, most of the tests in the MSCA produced 
substantial variation in scores at virtually all of the standardization 
age levels (McCarthy, 1972, pp. 204-205). In addition, the tasks in the 
MSCA provide broad coverage of the cognitive abilities that have 
been shown to be important in many previous studies of children's in- 
telligence, including Kelley's (1928) early factor analytic investigation. 
Finally, whereas many factor analytic studies in the literature utilized 
groups which were limited in size and were often from a particular city 
or socioeconomic class, the MSCA sample is representative of the U.S. 
population on many important variables such as color, region, and 
father's occupation. 


Developmental Interpretation of the Results 


Certain developmental trends were quite evident in the present 
analyses. The most important was the emergence of one additional 
ability at each of three levels: age 3, age 5, and age 6+. First consider 
that the four abilities found at the youngest age level— Verbal, Motor, 
General Cognitive, and Memory—were also identified at every subse- 
quent age level, with minor exceptions. (The exceptions: no General 
Cognitive factor at age 4; a blending of General Cognitive and Verbal , 
factors at ages 5 and 6+.) Then, at age 3, а Perceptual-Performance 
factor appeared—and this factor too was isolated at each older age 
level. A Quantitative factor appeared for the first time at age 5 and 
recurred at age 6+, which agrees with Thurstone and Thurstone (1948) 
who found that the ability to work quickly and accurately with 
numbers emerges gradually from a more global quantitative ability 
that is not distinct from other factors (such as verbal) in the young 
child. Finally, a conceptual factor called Reasoning emerged at age 
6+, probably reflecting the fact that many of the more difficult items in 
the MSCA are more abstract than the easier items intended primarily 
for preschool and kindergarten age children. 

It is of interest to relate the present results to Garrett's (1946) 
hypothesis—i.e., that there is a large general intellectual ability in in- 
fancy and early childhood which diminishes in size as the child gets 
older due to a gradual differentiation of specific abilities. To consider 
Garrett's hypothesis, one must ignore the nonintellectual psy- 
chomotor tests, as Guilford (1967, p. 414) points out. In the present 
Series of analyses, the gross motor variables tended to load only on the 
Motor factor, so this factor will not be considered in the following dis- 
cussion. 

The presence of the Verbal and Memory group factors as early as 
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age 2'2—with the Verbal factor accounting for the same percentage of 
variance as the relatively small General Cognitive factor—provides 
evidence contrary to Garrett's hypothesis. Although the emergence of 
а new group factor at ages 3, 5, and 6+ seems supportive of Garrett's 
theory, it is apparent that these group factors did not result from suc- 
cessive differentiations of a large general ability. In addition, there 
were two main developmental trends working in the opposite direction 
from the Garrett hypothesis; namely, the blending of the Verbal group 
factor with g at the two highest age levels, which suggests that once a 
child reaches school age his general mental ability may be closely in- 
tertwined with his verbal facility; and second, the appearance of a 
Drawing factor at ages 3 and 4 but not in subsequent analyses. The net 
result of the important developmental trends is that there were ap- 
proximately the same number of group factors at ages 3 through 6+, 
and that the percentage of variance accounted for by the several 
General Cognitive factors did not decrease with age. Overall, then, 
Garrett's hypothesis was not supported. 

The clear Drawing factor that emerged at ages 3 and 4 is worthy of 
comment, because it may reflect the changing nature of the abilities ге- 
quired for successful performance on the drawing tests at different 
ages (rather than indicating an ability that “disappears” at age 5). One 
may hypothesize that design copying and drawing a child are 
predominantly motor tasks for younger children, and gradually 
become more conceptual once pencil-and-paper coordination is 
mastered, If the Drawing factors at ages 3 and 4 are perceived as in- 
volving fine motor coordination (rather than a specific cognitive 
ability), the results of the present analyses suggest that the drawing 
tests are primarily motor at age 3, partially motor and partially 
cognitive at age 4, and predominantly cognitive at ages 5 and above. 


Construct Validity of the MSCA 


There аге three main questions to ask regarding the factorial 
evidence of the MSCA's construct validity: (1) Are there factors which 
correspond to the abilities assessed by each of the six scales at some or 
a of the age levels Studied? (2) Does the factor structure across age 
levels suggest that additional or alternative scales might have been 
selected for the MSCA? and (3) Do the factor loadings support the 
placement of each component test on its particular scale(s)? 

The first question has already been answered in the affirmative, as а 
factor corresponding to each scale was identified in two or more of the 
analyses. A Motor factor was found in all five analyses, as were both 
Verbal and Memory factors, Of these three, Motor showed the least 
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age to age variation, which was also true in Richards and Nelson's 
(1939) analyses at ages 6, 12, and 18 months. The Verbal factor, 
however, was merged with General Cognitive at ages 5 and 6+, and 
the Memory factor accounted for less than 10% of the variance at these 
same two age levels. General Cognitive and Perceptual-Performance 
factors were extracted in four of the five analyses, and only the Quan- 
titative Scale was represented in as few as two analyses. ч 

In response to question #2, there does not seem to be evidence sup- 
porting the need for additional or alternative scales. Factors labeled 
Semantic Memory and Reasoning each appeared at only one age level. 
As mentioned earlier, the emergence of the Drawing factor at ages 3 
and 4 suggests that Draw-A-Design and Draw-A-Child have a motor 
component for children in the younger half of the MSCA age range. 
(The meaningful loadings of Draw-A-Design on the Motor factors at 
ages 2\ and 3 also support this interpretation.) McCarthy’s placement 
of the two drawing tests on both the Perceptual-Performance and 
Motor Scales reflects the dual nature of the abilities required for these 
tests. This scale structure seems as good as any alternative solutions 
that might be suggested by the present data—particularly when one 
considers the practical advantages of having the same set of scales for 
all children within the MSCA age range. 

To answer question #3 regarding the appropriateness of the scale 
placement of the component tests, the pattern of factor loadings for 
each test was examined across the age range. The following points 
become apparent: 


1. The Motor factor tended to be defined by the Leg Coordination, 
Arm Coordination, and Imitative Action tests, all of which are 
included on the Motor Scale. 

2. Of the four tests on the Memory Scale, Pictorial Memory, Tap- 
ping Sequence, Part 1 of Verbal Memory, and Part I of Numerical 
Memory each had meaningful loadings on at least three of the 
five Memory factors. Of tests not on the Memory Scale, only Puz- 
zle Solving and Word Knowledge—both of which require some 
recall of past experience—had as many as three meaningful 
loadings. А 

3. Of the three tests on the Quantitative Scale, Number Questions, 
Counting and Sorting, and Part II of Numerical Memory each 
had meaningful loadings on both of the Quantitative factors; no 
other variable had a loading of .25 or better on both of these fac- 
tors. 

4. Of the six tests on the Perceptual- Performance Scale (excluding 
Right-Left Orientation which is not administered to children 
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below 5) all except for Tapping Sequence had meaningful 
loadings on at least three of the four Perceptual-Performance fac- 
tors. Of the tests not on the Perceptual-Performance Scale, only 
Counting & Sorting (which is largely nonverbal) and Verbal 
Fluency had three meaningful loadings. 

. Of the five tests on the Verbal Scale, Word Knowledge, Part I of 
Verbal Memory, Verbal Fluency, and Opposite Analogies had 
meaningful loadings on either four or all five of the Verbal fac- 
tors. Among other tests, only Number Questions, Part 1 of 
Numerical Memory, and Conceptual Grouping had four 
meaningful loadings, and these bear a logical relationship to ver- 
bal ability. 

. Of the 15 tests which make up the General Cognitive Scale, 9 had 
meaningful loadings on three or four of the General Cognitive 
factors, and two others had meaningful loadings on two of the 
factors. Of the three gross motor tests not on the General 
Cognitive Scale, only Leg Coordination had as many as two 
loadings of .25 or greater. 

Overall, then, the factor structure gave support to McCarthy's 
placement of the tests in the various scales, although in some instances 
the data might be interpreted as supporting alternative test place- 
| ments, For example, Number Questions (which involves comprehen- 
sion of oral questioning as well as number facility) might seem to merit 
placement on the Verbal as well as on the Quantitative Scale based on 
its loadings. As a second example, there is no empirical support for the 
placement of Part II of Numerical Memory on the Memory Scale. 
However, these and other similar discrepancies seem to be minor in the 
face of the high degree of consistency between the factor loadings and 
the scale Structure at the various age levels. In addition, there are cer- 
tainly logical and practical considerations which would support Mc- 
Carthy’s test placement (e.g., Part П of Numerical Memory is logically 
чо БДЫ test, and it would not have been practical to fragment the 
a Memory test by placing only Part I on the Memory Scale). 
ra inconsistencies аге due to the inherent complexities of the men- 
al processes of even simple tasks. The theoretical purity of factor 
Structure that one strives for simply may not be attainable with a prac- 
tical tool such as the MSCA. 

The present analyses are basically consistent with the preliminary 
analyses of the standardization edition which were available when the 
ы were selected; nevertheless, there аге also some important 
bs pian: For example, in the preliminary analyses at ages 3-3 
2, and 7%-8%, Leg Coordination did not have substantial 
loadings on the Motor factors; in addition, although Quantitative and 


n 


an 
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Perceptual-Performance factors were each tentatively identified in the 
previous analyses, they were isolated at different age levels (rather than 
at a single level, as was found in the present analyses at ages 5 and 6+). 
Thus, the factor analyses described herein not only give evidence for 
the construct validity of the MSCA; they are even more consistent 
with the final scale structure chosen by McCarthy than are the 
preliminary factor analyses. 
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POSSIBLE SAMPLING BIAS IN GENETIC 
STUDIES OF GENIUS 


DANIEL P, KEATING' 
University of Minnesota 


The data from Terman's Genetic Studies of Genius (1925-1959) 
relating to sample size, mean IQ, and variance of IQ scores were 
analyzed in terms of their conformation to the theoretically projected 
statistics derived from a consideration of the normal curve. Devia- 
tions from the theoretical projections lead to the probable conclusion 
that the sample size was too small, with the IQ scores clustered more 
closely about a significantly higher mean than projected. Although 
the major findings of the “Genius” study are not cast into doubt by 
this analysis, caution is urged with respect to comparisons to a nor- 
mal sample when the differences are not large. 


THE five-volume Genetic Studies of Genius (1925-1959), edited by 
the late Lewis Terman, has been widely and justifiably acclaimed as a 
landmark in longitudinal research. Its refutation of myths widely held 
at the time (e.g., that highly intelligent children are weak and sickly, 
that early ability is rarely maintained through adolescence and into 
maturity) was a starting point for work with gifted children, as well as 
for much research into the intellectual development of individuals. 
The study also illustrated the many difficult and often intractable 
Problems of large-scale longitudinal research, one of which is ex- 
amined here more closely. 

In Terman's (1925) selection of his gifted group, he realized that his 
Sample was not entirely correct. He stated: 


———— 

_ ` The author would like to thank Julian C. Stanley for his assistance in the prepara- 
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One may conclude that the method of selection employed, 
although far from ideal, probably led to the discovery of at least 80 
percent and possibly 90 percent of all the cases who could have 
qualified in the school population canvassed (p. 33). 


What he may not have realized was that his estimate of error might 
itself have been erroneous. There are a number of indications that it 
was. 

This is most clearly seen from an examination of the normal curve. 
With a mean of 100 and а ов = 16 (Terman and Merrill, 1937) the 
percentage of cases falling beyond +2.5с (i.e, 140 IQ) = .62%. 
Multiplying the population canvassed, 168,000 (Terman, 1925, p. 29), 
by this figure yields a projected sample of about 1042 cases over 140 
IQ. Terman's (1925) actual yield was 649 cases (.38% of the popula- 
tion), or 61.22% of the projected sample. Further, the projected mean 
of the portion of the unit normal distribution beyond +2.50 (140 IQ) 
is, after Kelley (1947, p. 297), 


ИО 2i piae s. 0.0175283 
TUS Т- Pau 0.0062097 
= 2.82270, 


Where y, s is the height of the ordinate 2.50 above the mean of the unit 
normal distribution, and P<a. is the area of the distribution below 2.50. 

The mean of this tail portion of the unit normal distribution is 2.82; 
thus the mean of scores beyond 140 IQ is 


100 + (2.8227)(16) = 145, 


where 100 is the overall IQ mean and 16 is the standard deviation of 
IQ scores, The actual mean of the gifted group was 151 (Terman, 1925, 
р. 45), a difference of 6; this is about .4¢ above the mean of the normal 
distribution beyond +2.5¢, 

Jensen (1969) has pointed out the variations in the normal curve for 
IQ at the extremes. This casts some doubt on the reliability of the 
difference between the projected mean and the actual mean. However, 
it reinforces the difference between the projected and actual size of the 
sample. The Proposed alteration of the normal curve would, if 
anything, increase the percentage of area under the curve beyond 
+2.50, thus increasing the size of the projected sample. 

The theoretical standard deviation for the normal distribution of 


scores beyond +2.5¢ may also be calcul, i is, again 
after Kelley (1045. MS calculated. The variance is, a£ 
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Corsa = 1+ CDu _ 


ji RE И>2.50 
M 2.5(0.0175283) _ T 
= 1 + 0.062097 о 
= 0.089187, 


where c,.55,,* is the variance of the unit normal distribution beyond 
+2.50 (values derived from Pearson and Hartley, 1956). For o = 16, о? 
7256, and в.) = 0.089187, the standard deviation for the portion 
beyond 140 IQ is 


м (.089187)(256) = 4/22.83187 = 4.8. 


By comparison, the standard deviation of the obtained sample was 
10.2 (Terman, 1925, p. 45). This suggests the picture of the theoretical- 
ly projectéd sample having scores clustered much more closely around 
а significantly lower mean (p < .001). 

The schematic comparison of the distribution of obtained IQ's 
above 140 in Terman's (1925) sample with the tail portion of the unit 
normal distribution in Figure 1 demonstrates the nature of the dis- 
crepancies calculated above. The shaded area represents the dis- 
crepancy leading to the inflated mean of the actual sample. This in- 
dicates, as сап be seen from Figure 1, that too few “low” subjects and 
too many “high” subjects were included in the sample. 

The calculations of the sample were performed on the grouped data 
found in Terman (1925). There is the possibility that the calculations 
might be affected by the grouping procedure. Unfortunately the 
original ungrouped data is not directly recoverable (Oden, personal 
communication), but the appropriate calculations were performed on 
the ungrouped data which were easily recoverable. No important 
differences were found between the grouped and ungrouped statistics. 

It might be argued that the above statistical arguments fail because 
9f the nature of the 1916 revision of the Binet-Simon (Terman, 1916). 
The mean IQ and с were not calculated at that time, and there was no 
Specification of the IQ distribution as a normal curve. One might infer 
from the technical monograph which accompanied the 1916 Stanford 
revision (Terman et al., 1917, p. 43) that б = 13.5. Using the same 
Teasoning as above, we obtain the following statistics for the 
theoretically projected sample: № = 267; mean = 143; standard devia- 
tion = 3.6. Thus the mean and standard deviation are even more dis- 
Crepant from the obtained figures than with an assumption of с = 16, 
but the obtained sample size is greater than the projected size rather 
than smaller, 


Frequency 


Ordinate of Unit 
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Tail of Unit Normal 
Distribution 


— — — — Terman (1925) data 


Distribution 


Normal 


и IQ Scores 
+250] Abscissa of Unit Normal Distribution 


ШК 1. Schematic comparison of actual sample and theoretically projected sam- 


There is also considerable deviation of the gifted group sample 
d 1925) from the sample projected by using Terman’s (1916, р. 

) percentages. At least 0.55% of the standardization group score 
above 135 on the 1916 scale. Even allowing .05% for the 136-139 
range, the projected sample size is 840. The actual sample (639) is thus 
76.7% of the projected sample. The projected error, therefore, ranges 
from a low of nearly 24% to a high of almost 40%. 

There are a number of plausible speculations regarding the source of 
this sampling error. First, the population from which the sample was 
drawn might have been markedly non-normal. Given the high number 
of students canvassed and the demonstrated normality of the 
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Stanford-Binet scores (Terman and Merrill, 1937), however, this seems 
unlikely. 

The second speculation seems more likely. Taking together the facts 
of a too small sample, a (possibly) too high mean, and a (possibly) too 
large standard deviation, the intuitive inference is that too few of the 
"borderline" cases (140-150) were included. Terman (1925), after con- 
sidering the defects of his selection techniques, conjectured about the 
nature of the cases he failed to locate: 


They would almost certainly have been found a little less ac- 
celerated in school. Some would be excessively shy, others lazy, 
and still others lacking in adaptability [p. 33]. 


A third possible source of sampling error was the concentration on 
urban and suburban canvassing, with the rural population being 
nearly ignored (Terman, 1925, p. 29). The concentration of talent (as 
measured by IQ) tends to be greater in metropolitan as opposed to 
rural areas (e.g., Terman and Merrill, 1937; von Fieandt, 1958). The 
overload of “high” cases may be partly attributed to this factor. 

If the error is actually closer to the 25-40% we have suggested than 
to the 10-20% Terman (1925) estimated, and if his characterization of 
those not included is correct, then one may easily see the ramifications 
for the significance of a number of this conclusions. Many of the 
Statistically significant differences between his gifted group and the 
“general population” which were reported in 1925 and subsequently 
throughout the longitudinal study (Burks, Jensen, and Terman, 1930; 
Terman and Oden, 1947; Terman and Oden, 1959; and Oden, 1968) 
may in fact lack significance for the specific sample which Terman 
originally prescribed, especially since the “missed” cases were likely to 
be less differentiated from the general population than the cases in the 
obtained sample. It is clear from both the number and the degree of 
differences obtained that the major conclusions of the “Genius Study 
Were warranted. Caution is urged, however, in the interpretation of 
data from the “Genius Study” where only a small difference was 
reported. 
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UNIVOCAL VARIMAX: AN ORTHOGONAL FACTOR 
ROTATION PROGRAM FOR OPTIMAL 
SIMPLE STRUCTURE' 


DOUGLAS М. JACKSON anD HARVEY А. SKINNER 
The University of Western Ontario 


Univocal varimax is an orthogonal factor rotation strategy aimed 
at improving upon the simple structure qualities of a preliminary 
varimax solution. This is accomplished by targetting for patterned 
rotation the highest element in each row of the varimax factor 
loading matrix. This tends to yield a solution in which each variable 
in the final rotated matrix maximally loads on only one factor. 
Univocal varimax is particularly relevant to research problems in 
which each rotated factor should be marked by a relatively tight 
cluster of variables. A FORTRAN IV program is described for the 
efficient analysis of large input factor loading matrices. 


THE purpose of this paper is to describe the essential characteristics 
of an orthogonal factor rotation program UNIVMX for achieving op- 
timal simple structure. 

. The rationale underlying this procedure, termed univocal varimax, 
15 an attempt to improve upon the simple structure qualities of a 
Preliminary varimax solution so that each variable maximally loads on 
only one factor. That is, program UNIVMX seeks to orient each factor 
through a relatively tight cluster of variables. Factors in the final solu- 
tion should be characterized by several high loadings, with the remain- 
ing loadings near zero. И is thus designed to avoid the situation, 
Sometimes encountered in using standard varimax, of a single variable 
In which moderate loadings on each of several factors are obtained. In 
Many respects, program UNIVMX may be considered an orthogonal 


—— 
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analogue to the Promax oblique rotation strategy of Hendrickson and 
White (1964). 

Potential research applications include the Tucker and Messick 
(1963) points-of-view model of multidimensional scaling, where it is 
important that each point-of-view factor be marked by a definable 
subgroup of judges in the scaling experiment (Cliff, 1968). Another ex- 
ample is the use of Q-technique factor analysis for classification 
research (Skinner, Jackson, and Hoffmann, 1974). Each entity factor 
should be aligned with a relatively homogeneous cluster of individuals 
to facilitate interpretation of the solution. Indeed, univocal varimax 
may be used in any application in which a given test is expected to ex- 
hibit loadings on one and only one factor, e.g., multitrait-multimethod 
matrices involving correlations among measures of intellectual 
abilities or of personality constructs. 


Computational Strategy 


Program UNIVMX first rotates an input factor loading matrix to a 
varimax criterion (Kaiser, 1958). Then, the varimax solution is scan- 
ned to identify on which factor each variable has a highest loading. An 
hypothesis matrix composed of 1, -1, or 0’s is generated on the basis of 
two criteria: that a variable have a loading on a particular factor above 
a user-defined minimum (or a default value of |.50|); and (b) that this 
loading is the highest loading for that variable. Finally, an orthogonal 
procrustean transformation is performed whereby the input factor 


loading matrix is rotated to a least-squares fit to the hypothesis matrix 
(Schénemann, 1966). 


Input 


The data deck for each problem consists of (a) a title card, (b) а 
parameter card specifying the number of variables and factors com- 
prising the input factor loading matrix, (c) a format card for reading 
the data, and (d) the input factor loading matrix. 


Output 


i Program output includes the (a) input factor loading matrix (0P- 
tional), (b) Preliminary varimax solution, (c) hypothesis matrix, (d 
transformation matrix, and (e) final univocal varimax solution. 


Capabilities and Availability 


The program is written in FORTRAN IV either for the cbc 
CYBER 73 or for IBM systems. The maximum input factor loading 
matrix is of order 200 variables by 30 factors. This limitation may be 
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- easily modified by the user to handle larger or smaller matrices depen- 
dent upon system capacity. Instructions for use, and a source listing, 
which includes a sample problem, are available from Douglas N. 
Jackson, Department of Psychology, The University of Western On- 
tario, London, Ontario N6A 5C2, Canada. Requests for the program 
should specify which version, the CDC CYBER 73 or the IBM, is 


desired. 
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THE PATH ANALYSIS OF COMPLEX 
RECURSIVE SYSTEMS 


CHARLES Е. TURNER’ 
London School of Economics and Political Science 
University of London 
Methods for the path analysis of complex (i.e., not fully recursive) 
causal models are briefly discussed. A computer program which 
simplifies analysis of such models and provides an option for 
automatically deleting marginal paths is described. 


method. 


PATH analysis was introduced by Wright (1921; 1934) to provide 
geneticists with a methodology for systematically describing the func- 
tional relations between variables based upon their intercorrelations 
and theoretically derived assumptions of causal asymmetry. In 
Wright’s own words, path analysis “is an attempt to present а method 
of measuring the direct influence along each separate path in such a 
linear, additive, causal] system, and thus of finding the degree to 
which variation of a given effect is determined by each particular cause 
[1921, р. 557].” Recently it has been demonstrated that path analysis 
has a wide range of interesting applications in the causal modelling of 
social and psychological processes (e.g., Duncan, 1966; Hauser, 1972). 
Land (1969) has provided the interested reader with a recent and 
readable formulation of the basic theory and limitations of the 


A computer program (PATHL) has been designed to simplify the 
path analysis of complex recursive systems in which variables that oc- 
cur contemporaneously in the real world must be introduced 
Simultaneously into a model. In the analysis of such models PATHL 
Tequires the assumption of noncausality within blocks of contem- 


Poraneous variables, as well as unidirectionality of effect between 
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stages of a given model, and the other normal assumptions of path 
analysis concerning adequacy ‘of specification, correlations among 
residuals, and homoscedasticity. Of course, the first assumption need 
not be made when PATHL is used in the analysis of a simple recursive 
system. 


Method and Output 


PATHL initially computes the means, standard deviations, and a 
triangular matrix of product moment correlations for all variables in 
the database specified by the user. Subsets of this information are then 
selected at each stage of the modelling process. The resulting sub- 
matrices of intercorrelations are inverted by the Gauss-Jordon method 
to provide least squares estimates of the regression coefficients as well 
as standardized path coefficients. The path coefficient for the residual 
term is derived, and confidence statistics for the coefficients (r-values 
and standard errors) as well as an analysis of variance for the model- 
ling stage are computed. 

Following Duncan's suggestion (1966, p. 7), PATHL provides users 
with the option of automatically producing a reduced model in which 


paths of marginal absolute value or statistical significance are 
eliminated. 


Machine Requirements 


PATHL is composed of routines written in FORTRAN IV and 
(IBM) OS/360 Assembler language. The execution of PATHL re- 
quires 90K bytes of core storage and two reusable i/o mediums. 

PATHL, which has been successfully used on a variety of IBM/ 


360/370 machines, should be convertible to other large-scale com- 
puting systems. 


Execution Control and Data Management 


Any number of new databases may be constructed and analysed БУ 
PATHL. The user provides one statement of database parameters and 
one format statement to describe each new set of input data. Facilities 
are provided for data Screening. 

Models are built from endogenous and exogenous variables which 
the user specifies in a simple manner at each stage of model construc 
tion. No more than 40 variables may be used in any single model. 
There are no restrictions on the number of models which may be con- 
structed from a given database subset. 
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When executing the PATHL load module on an IBM/360-91, each 
modelling stage requires an average of less than one-tenth of a second 
of cpu time. Database manipulations are only required when new sub- 
sets are requested by the user; these manipulations increase execution 
time to an extent proportional to the frequency of their occurrence, the 
size of the database, and the efficiency of the storage format of the in- 
put data. 


Availability 


The following materials may be obtained from the author: (a) 
source statements and optimized object modules stored in a con- 
venient format on 9-track magnetic tape, (b) printed listing of the 
source code and exemplary output, (c) a detailed guide to the use of 
the program, and (d) technical notes for programmers wishing to im- 
plement PATHL on non-IBM machines. 
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IRIS: A COMPUTER-INTERACTIVE APL PROGRAM 
FOR RECOVERING SIMPLE ORDERS' 


THOMAS J. REYNOLDS AND NORMAN CLIFF 
University of Southern California 


IRIS, a computer-interactive APL program for developing a sim- 
ple order for a set of stimuli, is described. IRIS executes interactively 
by presenting pairs to be judged at a remote terminal. The subject 
responds by indicating his preference, or, by indicating that he has 
no preference. From a subject's judgments, IRIS determines which 
judgments are implied via transitivity and presents only pairs for 
which no implications are known. A substantial reduction in number 
of paired-comparisons required to recover a simple order results. 


THE purpose of this paper is to describe the rationale underlying as 
well as the major features of a computer-interactive APL program 
IRIS, the objective of which is to obtain an individual’s simple order- 
ing of a set of stimuli in as few paired comparison preference judg- 
ments as possible. In a general context, IRIS attempts to attack the 
number-of-judgment problem associated with theoretically sound 
comparative judgment models, namely, the requirement of ММ — 
1)/2 pairwise presentations constituting all possible pairs. This 
problem is considerable, in that, for even moderately large stimulus 
Sets, the number of judgments required increases parabolically, 
reaching 300 and 780 judgments for N — 25 and 40, far exceeding any 
reasonable expectation in terms of both a subject's time and attention 
span. IRIS attempts to resolve this problem by determining which 
Pairwise presentations would be redundant, by implication, and 
thereby to avoid a duplication of information already known. 
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INTERORD, a FORTRAN counterpart to IRIS, (using an identical 
rationale) has been developed and is currently available (Kehoe and 
Cliff, 1975). 

A simple order may be completely determined by obtaining an 
observed preference judgment for each adjacent pair of stimuli (in the 
order). This principle of connectiveness in determining an order is 
founded in the graph theoretical assumptions of directed graphs 
(Harary, Norman and Cartwright, 1965). Thus, the complete deter- 
mination of a simple order can be accomplished once these ap- 
propriate N — 1 judgments connecting the adjacent stimuli in the 
order are known. The remaining (М — 1)(N — 2) judgments become 
redundant, as they are implied via transitivity. 

It has been proposed that log;N! is the minimum number of judg- 
ments required to obtain the N — 1 chain, which yields the complete 
order (Harary, Norman and Cartwright, 1965; Knuth, 1973). Since 
IRIS requires, for both errorless and errorful data, a number very 
close to this proposed minimum (Cliff and Reynolds, 1974; Reynolds, 
1975), the number of pairs that need be presented to a subject when а 
simple ordering is desired is greatly reduced. For example, for М = 25, 
the minimum number is 84, a substantial saving іп comparison to the 
300 (all possible pairs). More significant, however, is the reduction in 
the number of required judgments for larger stimulus sets. In the Na 
40 example, where 780 represents the number of all possible pairs, the 
minimum number is 160, which remains well within a reasonable 
number of judgments to require of a subject. 

The process of presenting only those pairs which are not redundant 
requires an interactive computer system utilizing a remote terminal 
from which the subject responds to each pair-wise presentation. Thus, 
the IRIS algorithm may be summarized, simply, as a process whereby 
a search for implications is made following each response, and, once 
having been determined, these implied judgments are recorded as if the 
subject made them directly. Thereby redundancy is avoided. 

| Тһе search for implications is based upon a Boolean matrix mul- 
tiplication procedure originally proposed by O'Neil and O’Neil (1973) 
and implemented Бу Cliff (1975, in press). The worth of this procedure 
stems from the fact that all elements of the response need not be sub- 
jected to matrix multiplication; rather, only those elements involved in 
the last paired-comparison need be considered in calculating the new 
implications. The saving in the total number of required calculations 
brought about by the utilization of this matrix multiplication short-cut 
not only accounts for the absence of any lag-time between Pres 


entations, but also keeps the cost of recovering an order from becom- 
ing financially prohibitive. 


SS 
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Program Features 


Input 


Program input is from a remote terminal. The input includes subject 
identification, the number of stimuli, and the stimuli names. Available 
input options include: recovering a user-specified number of randomly 
selected pairs from which a validation index is calculated on the basis 
of the final order and specification of the initial pairs, instead of the 
default random assignment. Also included in IRIS is the possibility of 
allowing a no-preference response to a given pair, which is analogous 
to a tie or subject's indifference. 


Output 


Output is only available at the terminal. The output includes the 
recovered order of stimuli with their assigned ranks, a measure of 
predictive validity of the final order, and other pertinent run statistics, 
e.g., the number of judgments. Also available to be called out by the 
user are any of the matrices or vectors used in calculations including a 
summary in matrix form of all preference judgments made and their 
presentation number. 


Limitations 


A maximum of about 50 stimuli, depending upon options utilized, is 
possible for use. 


Computer and Program Language 


Written in APL, IRIS is currently implemented at an IBM 370/158 
facility. 

А copy of the program and sample output may be obtained from 
Norman Cliff. 
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INTERORD: A COMPUTER-INTERACTIVE 
FORTRAN IV PROGRAM FOR 
DEVELOPING SIMPLE ORDERS’ 


JERARD Е. KEHOE AND NORMAN CLIFF 
University of Southern California 


INTERORD, a computer-interactive FORTRAN IV program for 
developing simple orders on a set of objects, is described. 
INTERORD executes in the interactive mode by presenting pairs to 
be judged at a remote terminal. The respondent at the terminal 
judges which one of the objects dominates the other. INTERORD 
utilizes the observed dominance relations to determine which ad- 
ditional relations are implied by the transitivity principle. Since pairs 
4 not implied are presented for judgment, a substantial reduction oc- 
] curs in the number of judgments required іп the pair comparisons 
procedure. 


ple order on a set of objects. Examples include such tasks as having a 


subject order choice alternatives according to his or her preferences, 


| having а personnel worker order job elements in terms of importance, 
he cells of a factorial design with 


or even having an individual order t! 1 
respect to а dependent variable of interest. Theoretically sound 


| IN many situations investigators require the determination of a sim- 
| measurement procedures such as comparative judgment methods fre- 


quently require a very large number of judgments, many of which may 
seem redundant with respect to previous judgments. 


Purpose 


The purpose of this paper is to describe principal features of a 
computer-interactive FORTRAN IV program for developing simple 
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orders (INTERORD). Reynolds and Cliff (1975) have described an 
APL version of the interactive ordering procedure. The objective of 
INTERORD is to determine a simple order for an individual respon- 
dent from a minimum number of paired-comparison dominance judg- 
ments. 

INTERORD operates in an interactive mode by presenting pairs of 
objects and by collecting dominance judgments from a respondent 
who is utilizing a remote control terminal. The interactive feature is re- 
quired in order that the program may determine after each judgment 
which of the judgments not yet observed are redundant with respect to 
the previously observed judgments. INTERORD presents to the 
respondent at the terminal only pairs which are not redundant. 


Rationale 


The determination of redundant pairs is based on the principle of 
transitivity of dominance judgments, which is fundamental to the con- 
struction of a simple order. The object pair (x, z) is redundant when it 
is implied by the transitivity principle. As soon as x dominates any ob- 
ject y and y dominates z, then x is assumed by transitivity to dominate 
2. Although INTERORD does not present for judgment such implied 
pairs, it does assume rather that the implied judgment for that pair ex- 
ists and does yield only those pairs for which no judgment is implied. 
The algorithm for determining these transitive implications is based on 
matrix multiplication utilizing Boolean arithmetic (Cliff, 1975). 

A simple order is completely determined when an order of objects 
can be found in which each pair of consecutive objects is connected 
with an Observed dominance relation. When such an order exists, all 
other pair relations must be implied by transitivity. It is not necessary 
to present them for judgment. INTERORD relies on this principle in 
determining the simple order for the respondent. This principle of con- 
nectedness is derived from a graph theoretic approach to relations 
among objects (Harary, Norman, and Cartwright, 1965). The number 
of judgments Tequired to define a complete simple order has à 
theoretical minimum equal to log, М! where М is the number of ob- 
jects. It has been found in practice that this number (which is 62 for 20 
and 215 for 50) 1S a very close approximation to the actual number of 
judgments required by INTERORD. 

The program also gathers a number (user-specified) of extra judg- 
ments which are stored separately. After the interactive session, thes¢ 
judgments are used to provide validation data for the final order. 
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The Program 


Input 


Program input is from the remote terminal. The input includes 
respondent identification, the number of objects, the object names, an 
advantageous starting order if appropriate, the number of randomly 
selected extra judgments, and some output options. An option exists 
for the object names to be read from a previously created data set in- 
stead of from the terminal. 


Output 


There are three output modes. Output available at the terminal in- 
cludes run statistics, object names, and the starting and final order. 
Printed output consists of run statistics, a measure of the predictive 
validity of the final order (utilizing the random extra judgments), the 
judgments, object names, the starting and final orders, and an optional 
matrix of responses. Punched card output is similar to the terminal 
output. Printed and punched output are created by an accompanying 
but separate program which executes in batch mode. INTERORD 
writes the run data onto a user-created data set which the output 
program in turn reads. 


Limitations 


Maxima of 50 objects, 240 judgments, and 30 random extra judg- 
ments are allowed. 


Computer and Program Language 


INTERORD and the output program are written in FORTRAN IV 
H complier for the IBM 370/158. It should be noted that 66K is re- 
quired for INTERORD and that 50K is needed for the output 
program. 

А copy of the program and sample outpu 
Norman Cliff, Department of Psychology, 
California, Los Angeles, California 90007. 


t may be obtained from 
University of Southern 
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MAPPING INDIVIDUAL LOGICAL PROCESSES 


FREDERICK О. SMETANA 


North Carolina Science and Technology Research Center 
Research Triangle Park, N. C. 


A technique to measure and describe concisely a certain class of in- 
dividual mental reasoning processes has been developed. The 
measurement is achieved by recording the complete dialog between a 
large, varied computerized information system with a broad range of 
logical operations and options and a human information seeker. A 
type of flow chart, familiar to computer programmers, is used to 
delineate this dialog in a hierarchical fashion. An example of such a 
chart is given in the paper. Results obtained on a limited number of 
investigations suggest the technique may be useful for studying 
various aspects of mental performance. 


CERTAIN individuals display an unusual ability to find specific infor- 
mation quickly in large collections such as libraries. Perhaps if one un- 
derstood sufficiently well the complex logical process by which these 
individuals operate, one might quantify the processes somewhat in the 
form of a logical “map.” Then, by obtaining and analyzing a 
Statistically significant number of such logical mappings, one may be 
able to deduce certain universal laws for effective information transfer. 

The prospect of being able to do this task and the knowledge of its 
Potential impact on education led, upon the acquisition of suitable 
Measuring equipment, to the design of a brief experiment whose 
tesults are reported in this paper. The experiment is described in detail 
їп Smetana (1975). 

Briefly, the experiment recorded the complete dialog between a 
computerized information retrieval system and a human information 
Seeker, Since the information collection is large and varied (7 X 10° 
Citations in 34 broad subject categories) and since the range of logical 
Operations and options which one can request that the system perform 
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is quite varied, the information seeker can permit his imagination and 
experience almost limitless freedom in retrieving answers to specific 
questions. In other words, the individual's thought processes are not 
significantly constrained by the system. By recording the instructions 
given by the information seeker to the machine, its response, and his 
reaction to that response, it is possible to gain considerable insight into 
what the seeker is actually thinking during the search process— 
literally how his mind works in solving problems of this type. This in- 
sight can be translated into a formal flow chart which depicts in sym- 
bolic shorthand the chronological and hierarchical relationships 
between concepts, responses, and instructions. 

On the basis of some limited but successful experience with these 
recording and charting techniques it is suggested that they may be 
useful to other investigators attempting to measure and to characterize 
various aspects of mental performance. 


Recording Device 


Instructions to the computer are entered by means of typewriter 
keyboard. The instructions and the system response are displayed опа 
TV screen placed behind the keyboard. To one side of the keyboard is 
a printer which records everything displayed on the TV screen. These 


typed records then are the raw material from which the flow charts are 
constructed. 


Flow Charts 


Construction of the flow charts (an example of a section of a com- 
plex flow chart is shown in Figure 1) requires the exercise of some 
judgment on the part of the drafter, particularly concerning the 
antecedents or sources of certain concepts. Thus there may be some 
variability in assigning hierarchical relationships. But if one has à 
reasonable familiarity with the subject area, this variability is usually 
not significant; in fact, it will usually be found that flow charts drawn 
by various individuals with an understanding of the subject area and 
of the approach to be used in constructing the charts seldom differ. 

As an example of this lack of any practical difference in basic 
perception, опе may mention that the 10 flow charts assembled to date 
were drafted by three different individuals. Yet when the one in- 
dividual examined the flow charts drafted by the other individuals ап 
compared them with the dialog recording, he seldom found an out- 
right error and, more importantly, seldom could offer a suggestion for 
improving the graphical representations. This circumstance indicates 
that the process is reasonably independent of the drafter. 
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Concluding Remarks 


Examining a flow chart in detail is a fascinating experience in retrac- 
ing someone's thought process. It becomes even more fascinating 
when one examines how several individuals have attacked the same 
problem. One soon discovers that, for complex problems at least, it is 
very difficult to determine (even after the fact) what was the optimal 
strategy for solution of the problem. That several completely different 
yet seemingly equally logical approaches have achieved comparable 
results indicates the need to analyze a large number of solutions for 
statistical significance in order to detect which approach or class of ap- 
proaches has achieved superior results for a particular problem. The 
technique of recording the thought process dialog in detail and of ex- 
pressing it in the form of a flow chart seems particularly well suited to 
characterizing the human thought process in a formal manner so that 
it may serve as an ideal raw material for further analysis. 
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ITANA—III: A FORTRAN IV PROGRAM FOR MULTIPLE- 
CHOICE 
TESTS AND ITEM ANALYSIS 


BARUKH NEVO, ELI SHOR, AND RACHEL RAMRAZ 
University of Haifa, Israel 


A two-phase FORTRAN IV program called ITANA—III for an 
IBM 1130 computer is described that permits computation of psy- 
chometric characteristics of multiple-choice examinations including 
test statistics (phase I) and item statistics (phase П). Consisting of 
280 statements, the program can handle up to 200 items with not 
more than 9 alternatives from samples of examinees not exceeding 

000. 


THE purpose of this paper is to describe а two-phase computer 
program (ІТАМА--ІП) that was designed to calculate various psy- 
chometric characteristics of (a) multiple-choice tests (phase I giving 
test statistics) and (b) individual items (phase II providing item 
statistics). 

The results of the analysis of scores on a multiple-choice test carried 
Ош in phase I may be used in applying and interpreting the results of 
item analyses carried out in phase II. Written in FORTRAN IV the 
Program is for an IBM 1130 computer. 


Program Input and Output 


The input consists of the answers of N subjects to a multiple-choice 
test with m items, each with k mutually exclusive alternatives. 

The output of phase I consists of (a) individual raw scores, (b) mean 
ànd standard deviation, (c) frequency distribution (table, histogram), 
(9) chi-square test for testing the normality of the frequency distribu- 
tion, (е) split-half reliability coefficient between scores on odd and 
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even items, (f) Kuder-Richardson reliability coefficient (formula 20), 
апа (в) product-moment correlation with one or more external criteria 
(if supplied). 

The output of phase II consists of (а) point biserial correlation over 
subjects between individual items and raw scores, (b) point biserial 
correlation over subjects between individual items and one or more ex- 
ternal criteria (if supplied), (c) difficulty-levels of individual items (n/N 
or n/A, where n indicates the number of subjects who answer correctly 
this item, N indicates the total number of subjects, and A denotes the 
number of subjects attempting to answer this item), and (d) proportion. 
of answers for each of the К alternatives. 


Program Capacity and Limitations 


The program consists of 230 statements. Limitations are: N < 2000; 
m < 200; К < 9. ( 


Availability 


A listing of FORTRAN statements, detailed directions for use, and 
example input and output may be obtained from Barukh Nevo, 


Department of Psychology, University of Haifa, Mt. Carmel, Haifa, 
srael, 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT. 
1975, 35, 685-687. 


A PROGRAM SYSTEM FOR THE ESTIMATION OF 
CHARACTERISTICS OF THE TEST SCORE 
DISTRIBUTION RESULTING FROM TEST 

ITEMS WITH GIVEN STATISTICS 


LEE L. SCHROEDER 
Burlington County College 


This paper describes a program system which was developed for 
the purpose of making estimates about the nature of the distribution 
of test scores which will result from an administration of a test com- 
posed of items on which estimates of item-difficulty and discrimina- 
tion are available. In developing these estimates, the test constructor 
is free to make assumptions about the nature of the group of ex- 
aminees and the number of examinees to be tested. It is expected that 
test developers will find this program system useful in the test 
development process. 


AN integral step in the test development process is item pretesting. 
After test items are written, critiqued and revised, it is common to as- 
semble preliminary tests composed of these items and to administer 
them to groups of subjects representative of the population for which 
the test is designed. Subsequently, the items are statistically analyzed 
10 uncover any flaws in the items which were not formerly apparent. 
Items surviving this screening then become part of the pool from 
Which items will be drawn to assemble the final test. 

х Many standardized tests are revised annually. The annual test ver- 
Sions are assembled based on item statistics determined from item 
Pretesting as just described. Whenever tests are assembled from item 
Pools in this manner, one problem is common for all test developers. 
That is, since the items on a new version of an examination were not 
Necessarily tested together, it is necessary to develop methods of es- 
timating the nature of the distributions of scores which will result from 
the administration of the test version. The only parameter of a score 
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distribution which can be estimated directly is the mean. The familiar 
formula 


X=". pi 


indicates how the test mean may be estimated from prior knowledge 
of the difficulty levels of the items. In this formula, X is the test mean, 
n is the number of items on the test, and p, is the difficulty level, or 
proportion correct, for the ith item. Several other parameters of the 
score distribution, however, are of interest. Among these parameters 
are the standard deviation of scores and the estimated test reliability 
(К R-20). 


Overview of the Computer Programs 


To facilitate the review of expected sampling distributions of score 
statistics based on prior knowledge of the item-difficulty level and 
item-test biserial correlation for each item in an item set, the Test 
Simulation System (TSS) was developed. TSS was developed in the 
BASIC language to operate in a time-sharing mode. The system con- 
sists of five program segments and of several data files. Only one 
program is required to be in core at any moment so as to minimize 
соге requirements. 

Using TSS, the test constructor input may specify the number of 
subjects to be included in the simulation as well as the number of com- 
plete test simulations to be made. Since the simulation scheme is based 
on the latent trait model (Baker, 1965), the mean and standard devia- 
tion of the trait distribution are specified by the user. (A mean of zero 
and standard deviation of one are assumed as in score form, standard.) 
Last, the item difficulty levels (p) and item test biserial correlations () 
are entered as prompted by the computer. 

Subsequently, the system output completes as many test replications 
as indicated for as many subjects as indicated. For each item, p and” 
values are converted into parameters of the adjusted cumulative 
logistic function (Lord and Novick, 1967) and, for each subject, 2 
Probability of scoring correctly on the item is generated. Based 00 
these Probabilities, all subjects are scored on each item. The output of 
this process is a test matrix, This test matrix is then analyzed by СОП” 
puting the twenty-two statistics in Table 1. These twenty-two statistics 
are stored, for each replication, in a disk file. After all replications 
have been performed, the final program segment produces histograms 
and Summary statistics for each of the twenty-two statistics. From 
these histograms, or expected sampling distributions, for each statis 
inferences may be drawn regarding the nature of the score distributio" 
which would result if the items perform as anticipated. 
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TABLE I 
Twenty-Two Test and Item Statistics Associated with Test Matrix 


Statistics 
Standard 
Sample Mean Deviation — Skewness Kurtosis 

Test scores 1% 2 3 4 
Item-Difficulty Levels 5 6 7 8 
ETS Delta 9 10 11 12 
Item-Test Point-Biserial Cor- 

relation 13 14 15 16 
Item-Test Biserial Correlation 17 18 19 20 


21 Kuder-Richardson 20 

22 Kuder-Richardson 20 adjusted by 
the Spearman-Brown formula 
to a 100-item test. 


“Тһе number in the Table is the number of the variable in the system output. 


Discussion 


When applied to a post-hoc analysis of a teacher made test, TSS was 
Shown to provide exceptionally accurate estimates of parameters 
Which were observed in the test data (Schroeder, 1975). Furthermore, 
Schroeder (1975) has shown the system to be valid in that known 
relationships between test item and test score statistics based on em- 
pirical research are found to exist in data generated from TSS. 


Availability 


Program listings and a paper describing the use of TSS in research 
and test construction activities may be obtained from Dr. Lee L. 
Schroeder, Director of Measurement and Evaluation, Burlington 
County College, Pemberton, New Jersey 08068. 
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A COMPUTER PROGRAM TO CALCULATE ADJUSTED 
AND UNADJUSTED INTERRATER RELIABILITIES 
FOR SETS AND SUBSETS OF JUDGES 


JOHN F. GREENE 
University of Bridgeport 


WILLIAM M. McCOOK 
University of Connecticut 


FRANCIS X. ARCHAMBAULT 
Abt Associates 


A computer program to be used to assess interrater reliabilities has 
been written. Given judges’ ratings on a set of variables pertaining to 
subjects or events, the program will produce for each variable, a 
printed analysis of variance summary table for between subjects/ 
events, within subjects/events, and between judges sources of 
variability, a reliability coefficient, an adjusted reliability coefficient 
and means and standard deviations for each rater. Further, analysis 
of variance and reliability output for subsets of judges is generated. 


Witt many psychometric instruments, particularly those that elicit 
Tesponses from open ended questions, the judgment of the scorer isa 
critical element, Examples of such instruments include tests of 
creativity and projective tests of personality. It has been noted that 
with these types of instruments there is as much a need for an estimate 
of Scorer reliability as there is for more conventional reliability coeffi- 
Clents (Anastasi, 1968, p. 86). 

In response to the need to assess between judge variability and sub- 
Sequently interscorer reliability, a cycling type of computer program 
has been written, The program generates a reliability estimate of the 
Pooled ratings of the judges using analysis of variance techniques (see 
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Winer, 1962, pp. 124-132) for all judges and for combinations of sub- 
sets of judges. At present, the program cycles through all combina- 
tions of one less than the total number of judges in addition to the 
total set of judges. The user may find this procedure useful if he needs 
to determine whether any one of the judge's ratings should be deleted 
because of systematic variation in that judge's scores that can be 
related to differences in that judge's training, experience, or frame of 
reference (see Greene, 1970, p. 26). The program also generates ad- 
justed reliabilities, generally higher than the original reliabilities, 
which have empirically eliminated the effect of differences in judges’ 
means. These reliability estimates should be utilized when the in- 
vestigator is not willing to accept the assumption of mean 
homogeneity (Ebel, 1951). Winer (1962, р. 128) refers to this 
procedure as adjusting for differences in frame of reference. 


Input 


Input for the program consists of the following parts: 

1. A control card describing the number of variables that were 
rated, the number of judges, and the number of subjects. 

2. A data matrix of each judge’s ratings for each subject on each 
variable, 


Data are read in for each subject, one judge at a time, for all of the 
variables under consideration. 


Output 


The program yields the following outputs for each variable: 

1. An analysis of variance Summary Table which includes sources 
of variation for between subjects/events, within subjects/events, 
between raters, residual and total, as well as corresponding 
degrees of freedom, sums of Squares, and mean squares. 

. An estimate of the interrater reliability based upon the pooled 
ratings of the judges. 

. An adjusted reliability coefficient based upon elimination of the 
effect of differences in rater means. 

. А mean and standard deviation for each of the judge’s ratings. 

. Separate analysis of variance, interjudge reliabilities, and 86° 
justed reliabilities for each subset of judges. A judge 06 
parameter indicates which judge was not considered in the PF 


ticular analysis. Judge code 1 indicates that all judges were 007° 
Sidered in the analysis. 


> 


[zi 
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6. A summary table of reliabilities and adjusted reliabilities for each 
variable and each judge code. 


Capabilities and Limitations 


The program is written in PL/1. Central processing unit time was 
approximately 21 seconds and required 105K with 5 judges, 4 
variables, and 137 subjects on an IBM 360, Model 65 computer run- 
ning under OS. The program is presently set up to handle a maximum 
of 5 judges, 200 subjects, and 23 variables but is easily modified to han- 
dle larger data sets and different variations of subsets of judges. 


Availability 


A listing of the program, sample problem, and documentation is 
available from Dr. William M. McCook, University of Connecticut, 
School of Pharmacy, Box U-92, Storrs, Connecticut 06268. 
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A PROGRAM FOR THE T-SCORE NORMAL 
STANDARDIZING TRANSFORMATION 


RONALD C. WIMBERLEY 
North Carolina State University 


This FORTRAN program transforms raw values of a variable 
into T-scores having a normal distribution with a mean of 50 and a 
standard deviation of 10. Options are available for the categorization 
of data, the assignment of missing data, and the merging of old input 
data records with new output records which contain T-scores of the 
variables for each case. New records alone or new plus old record 
output may be placed on cards, tape, or disk. 


Tuis article describes a program for the T-score technique of normal 
Standardization, T-scores transform а raw score distribution, 
regardless of its skewness or kurtosis, into a normal distribution with a 
mean of 50 and a standard deviation of 10. First, this technique places 
à set of input raw scores into a cumulative distribution. Then, the raw 
Scores are assigned to the Z-scores which correspond to the Y- 
Ordinates of either the means or the midpoints of the original raw 
Score categories. These Z-scores are next transformed into T-scores 
ranging from 00 to 99 by the formula, T = 10Z + 50, One discussion 
of this technique is given by Walker and Lev (1958, pp. 192-201). This 
technique is not to be confused with other so-called T-normalizing ар- 
ргоасһев which calculate the underlying Z-scores directly from а raw 
Score distribution without adjusting them to their proper proportions 
under the normal curve. 


———— 

.' This program was partially supported by the North Carolina State University 
Faculty Research and Professional Development Fund as well as the North Carolina 
State University Agriculture Experimental Station Regional Project NE-89. Apprecia- 
ion is expressed to Edward Cureton for showing the writer the merits of the T-score 
transformation and to Alen Baker for programming assistance. 
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Rationale 


This T-score program is useful for approximating the normality as- 
sumptions of linear parametric statistics when ordinal data are used. 
An inherently linear relationship among the T-scores of different 
variables is free of mismatched kurtoses, skewnesses, and standard 
deviations which attenuate correlations or which lead to artificial non- 
linearities in regressions. Furthermore, the T-score transformation 
should generally result in a more nearly normal distribution than that 
provided by other transformations such as those from logarithms, ex- 
ponents, or roots. 


Description and Input Capacities 


The program which is written in FORTRAN IV, H-level, is for use 
on an IBM 370/165 system. Input may be from cards, tape, or disk. 
The program comes in two versions. The first, REGLR NORMSTAN, 
is for as many as 120 variables which have as many as 120 values 
apiece. The other, SUPER NORMSTAN, is for 10 variables, any one 
of which may have up to 1000 values. Since calculation of T-scores 
from a large number of values is quite time consuming, the program 
allows variables to be categorized. Forty-one or fewer categories may 
be specified by the program user. 


Computation Strategy 


Raw variables in either REGLR or SUPER NORMSTAN which 
have six or fewer categories аге converted to T-scores by the mean 
technique. This subroutine places each raw score value at the Y- 
ordinate for the mean of a category's cumulative proportion under the 
normal curve. For variables with seven or more categories, the mid- 
point technique is used to place each raw score at the median rather 
2” the mean of its category in the cumulative proportions. The 
mean technique provides a better approximation to the normal curve 
ae midpoint technique when there is a small number of raw 
Score categories. 

Pu may be excluded from the calculation of T-scores у 
п tea T assigning missing data codes to another two-dig! 
aul lj raw Score variable has several missing data categories. 

as “not applicable" and “по answer,” these may be given пей 
scores, such as 98 and 99 Tespectively, or both may be awarded the 
same score. These extreme reassignment values are unlikely to occur in 
an actual set of T-scores, since they are nearly five standard deviations 
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п the mean. If, instead, it is desired to reassign missing data to the 
of the T-score distribution, the designated value would be 50. 


Output 


"Output of the T-scores may be merely a printout or a printout plus 
er card, tape, or disk output of new data records containing case 
tification numbers, record numbers, and a series of two-digit T- 
res for the transformed variables. Identification numbers are 
ied from input data records. The record number may be in- 
ented from the number of input data records or specified by the 
‚ The T-scores begin in column 13. Should there be more than 34 
ables, their T-scores are continued to another record bearing the 
case identification and an incremented record number. In addi- 
lion, the new record(s) for each case can be produced by themselves 
p EC with reproductions of the input case records on cards, tape, 
or disk. 


Availability 
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method utilizes the T notation. 


WHEN scores are missing from some subjects in a score data matrix, 
the researcher must be careful that the computer does not treat the 
missing data, represented by blanks in the card deck, as zeros. While 
blanks and zeros may be treated alike to obtain a sum, they must be 
distinguished from each other if means, standard deviations, or other 
summary statistics are computed. This paper presents а method for 


distinguishing blank fields from zeros. 


Neither the integer nor the floating point notation in FORTRAN 
distinguishes between a blank field and a zero. The alphabetic format 
does distinguish, but of course alphabetic characters are not quan- 
tities, and arithmetic operations cannot be performed upon them. 
Thus, if the quantities 123 and 246 are read from a data card subject to 
the format expression (213), they can be added together, but if they are 
read subject to the format expression (2A3), they cannot be added 


together, 


The T format code specification (T for tabulate) was developed 
primarily for the printing of column headings, but it may also be used 
On the input to skip from one column to another. The skipping can be 
backwards as well as forwards; hence the programmer can attain the 
Same effect that a REREAD statement has on other computers. For 
instance, the format expression (8011, ТІ, 80A1) reads all 80 columns 
in the data card twice, first in the integer mode and then in the 


alphabetic mode. 
Copyright © 1975 by Frederic Kuder 
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FORTRAN ON THE IBM 360 COMPUTER 
| NATHAN JASPEN 
New York University 
A method is presented for distinguishing between blank fields and 
zeros in FORTRAN programs written for the 360 Computer. The 
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In the appended program, the critical statement is the forma! 
ment labelled 2. The T notation causes the scores, which have al 
been read as alphabetic fields, to be re-read as floating point qui 
tities. i 

Five score cards, shown in Appendix 11, furnish the input 
this program. In this program a PRINT statement follows 
READ statement; hence all input cards are printed. After the 1 
card is read, the three sets of computations (the sums, the N’s, an 
averages) are printed. Each N includes zero scores, but еж 
blanks. 


Limitations 


A restriction of this method of expressing data in more t 
mode is that the characters must be legal in each mode. For exam, 
is not legal to use the FORMAT expression (А3, ТІ, F3.1, ТІ, 
tead the quantity 1.2, because the quantity 1.2 is not an Eo 
cannot be expressed in the I notation. 


Appendix 1 
A FORTRAN Program that Distinguishes Zeros from Blan 


С TO DISTINGUISH BLANKS FROM ZEROS 
C МЈАЅРЕМ MARCH 1974 


DIMENSION А(5), X(5), S(5), SUM(5), AV(5) 
DATA BLANK 
FORMAT (А!) 
FORMAT QX, I6, 5A4, T9, 5Е4.0) 
FORMAT (2X, ‘SUM’, 3X, 5Е4.1) 
FORMAT (2X, “М”, 3X, 5Е4.1) 
FORMAT (2X, “АУ”, 3X, SF4.1) 
DO501- 1,5 
$ = 0 
SUM() = 0 
АУ(1)-0 
m CONTINUE 
READ (5,2, END = 300 
PRINT2,ID, A ду 
DO 2001 = 1,5 


IF (A(D-BLANK) 150, 200, 150 


ч OUO دم‎ — 
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150 8(1) = S(D + 1.0 
SUM (I) = SUM (I) + Х(1) 
200 CONTINUE 
GO TO 100 
300 DO4001-1,5 
IF (S(1)) 400, 400, 350 
350 AV (I) = SUM (I/S (1) 
400 CONTINUE 
PRINT 1, BLANK 
PRINT 3, SUM 
PRINT4,S 
PRINT 5, AV 
CALL EXIT 
END 
Appendix 11 
Printed Output of the Program in Appendix 1 
11111 1 2 3 2 
11112 2 3 0 
11113 2 4 1 
11114 6 1 
11115 0 1 0 
SUM 9.0 8.0 9.0 2.0 0.0 
N 3.0 4.0 5.0 2.0 1.0 
AV 3.0 2.0 1.8 1.0 0.0 
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THE CALCULATION OF CORRELATION MATRICES 
USING SINGLE SUBSCRIPT NOTATION 


NATHAN JASPEN 
New York University 


A method is presented of calculating correlation matrices using 
single subscript rather than double subscript notation. This saves 
time and space, and permits the calculation of larger matrices in the 


space available. 


CORRELATION programs usually use double subscript notation, since 
the correlation matrix is two-dimensional. The purpose of this paper is 
10 present a method of producing correlation matrices that employs 
single subscripts. The single subscript method is superior to the double 
subscript method because it saves time and space, and permits the 
computation of large matrices. 


The Double Subscript Method 


Coding such as the following is typical for calculating sums of cross- 
products: A 


DO 5001 = 1, M 
ро 500 J = 1, M 
500 C(I, J) = C(I, J) + XX * XQ) 


In this illustration, M is the number of variables, X represents the 
variables, and C represents the cross-products. It is assumed that M 
has already been defined for the particular problem and that the cross- 
Product matrix C has been properly zeroed, The matrix C is square, 
consisting of M columns and M rows, and all cells are computed. 

An alternate method of coding, probably more widely used, is the 
following: 
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DO 5001 = 1, M б 
DO 500 J = 1,1 


as the square method, and therefore, uses approximately half as much 
time. The space requirements are, however, identical if double- 
subscript notation is used. Furthermore, the maximum size cross- 
product matrix that can be squeezed into any given number of storage 
units is an M by M matrix, where M is the square root of the number 
of storage units available for the crossproduct matrix. 


The Single Subscript Method 


Either the square method or the triangle method can be executed 
with single-subscript notation, but only the triangle method is of in- 
terest here. The advantage of the single subscript notation is that it 
saves not only time but also space. Alternatively, in a given amount of 
space a larger matrix may be stored. 

Consider the following 3 by 3 matrix. Not all the cross-products are 


computed, but only C(1, 1), C(2, 1), C(2, 2), C(3, 1), C(3, 2), С\З, 3). 
Now, let 


il 


Е(1) 
RQ) 


C(1, 1) 


С(2, 1) 
R(3) = CQ, 2) 
R(4) = C(3, 1) 
R(5) = C(3, 2) 
R(6) = C(3, 3) 
and, in general, for Square matrices, of any order, 


Г 
500 C(I, J) = C(I, J) + Ха) * X(J) 
This triangle method requires only about half as many calculations 


R(K) = C(I, J). 


| In this statement, К = (I*(I — 1)/2 + J. К is always an integer» 
since I and I — 1 are consecutive numbers, опе of which must always 


be even. If J exceeds I, these two index values must be reversed in the 
above formula. 


The coding is as follows: 


DO5001- 1, M 
DO 500 J = 1,1 
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К = (I1 — Dy2 + J 
500 R(K) = R(K) + X(I) « XQ) 


Assuming that the sums corresponding to each variable have been 
stored in a column vector SX, and that the number of cases is S, then 
the following coding for the computation of means and standard 
deviations will appear somewhere in the program: 


DO6001 = 1, M 


К = (I*(I + 1)/2 
AV (I) = SX(1)/S 


VAR = R(K)/S — AV(I) **2 
SD (1) = 0 


IF (VAR) 600, 600, 580 
580 SD(I) = SQRT(VAR) 
600 CONTINUE 
Following this routine, the correlation coefficients may be 
computed: 
DO 7001 = 1, M 
ро 700 J = 1,1 
К = (I* (I 1)/2 + J 
DEN = SD (1 * SD (J) 
R(K) = 0 
IF (DEN) 700, 700, 680 
680 R(K) = (R(K)/S — AV(I) * AV(J))/DEN 
700 CONTINUE 


The print-out routine must also provide for single-subscript 
Notation. Also, if the correlation matrix is very large, it will be neces- 
вагу to partition it both horizontally and vertically for printing. It is 
generally convenient to print 16 columns and 48 rows of correlations 
Per page. The following routine will number the rows and columns, 
Partition the square matrix, and perform the printing: 


DO 8001 = 1, M, 16 
Jy= 1-1 + 16 
IF (M — JJ) 710, 720, 720 
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710 JJ = M 
720 JK = JJ — IL +1 
DO 800 KK = 1, M, 48 
PRINT 19 
PRINT 23, (1, I = Il, JJ) 
PRINT 18 
LL = KK — 1 + 48 
IF (M — LL) 730, 740, 740 
730LL - M 
740 DO 800 1 - KK, LL 
DO 770 J = п, JJ 
W=J-u+1 
IF (I - 1) 760, 750, 750 
750 К = (IF (I — 1))/2 + J 
GO TO 770 
760 К = (J* J — 1))/2 + I 
770 Q (Ш) = R(K) 
800 PRINT 24, 1, (Q(IID, Ш = 1, JK) 
18 FORMAT (‘0’) 


19 FORMAT (‘1’) 
23 FORMAT (18Х, 16 (4X, 13)) 


24 FORMAT (11X, 14, 3X, 16F7.3) 
DIMENSION Q(16) 
DIMENSION АУ(220), SD(220), $Х(220), X(220) 


DIMENSION R(24310) 
, It is a simple matter to expand this routine to include alphabetic 
titles for the rows and columns. 

The above 
for R equals 


problem is dimensioned for 220 variables. The dimension 
casion 


(220 х 221)/2. This can be increased or decreased as 00 
c demands and computer capacity permits. 
Finally, it may be по 


ted that while the assignments statements ЮГ К 


` 
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add additional instructions to the program, the program with single 
subscripts compiles and executes faster than a program with double 
subscripts. 


Effect of Single Subscripting on the Size of the Matrix 


As an illustration of the effect of single subscripting on the size of 
the maximum matrix, suppose that 24310 words are available for the 
cross-product matrix. If the double subscript notation is used, the 
maximum matrix that can be calculated is 155 by 155. If single sub- 
scripts are used, the maximum size matrix is increased to 220 by 220. 


Availability 


The segments provided herein can be readily combined into the 
user's program to suit his requirements. However, à source listing, 
sample problem and documentation are available at cost (two dollars) 
from the author at New York University, School of Education, 32 
Washington Place, New York, New York 10003. 


Ad LI 
radi 
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A COMPUTER PROGRAM TO CREATE A 
POPULATION WITH ANY DESIRED CENTROID 
AND COVARIANCE MATRIX 


JOHN D. MORRIS 


Rehabilitation Research Institute 
University of Florida 


A computer program written in FORTRAN IV is presented which 
will create a population of desired size with marginally normal score 
vectors manifesting any desired centroid and covariance matrix. 
Uses and documentation are provided. 


THE most usual and traditional direction for the statistical analysis 
of data has been that of reduction. In consonance with scientific раг- 
simony, relatively large masses of data are quantitatively reduced to 
much simpler indices. There may be, however, occasions when the 
researcher may wish to proceed in the reverse direction of expansion. 
More specifically, since almost all parametric properties (excluding 
higher order moments) of a multivariate data set are subsumed within 
the centroid (vectors of means) and covariance matrix (or equivalently 
in the correlation matrix and variances), occasions may arise when the 
methodologist and/or the practical behavioral science researcher may 
wish to create a population with a desired number of score vectors 
which are marginally normal and manifest a desired centroid and 
covariance matrix. The purpose of this paper is to describe a computer 
Program that will generate a population with these very 
characteristics, 


Three Occasions for Generation of a 
Population with Specific Characteristics 


Three occasions are outlined for which the creation of such a pop- 


ulation might be desired. 
Copyright © 1975 by Frederic Kuder 
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Instructional Uses 


In the multivariate statistics or measurement classroom, the instruc- 
tor could generate data exhibiting whatever parametric characteristics 
he might wish and could assign these to groups of students for analysis 
and interpretation. These data sets could include problems of easy in- 
terpretability (of the level of Cattell's *plasmode," 1966, р. 223), and 
relatively difficult interpretive problems. Possibilities for in- 
dividualized instruction of students at different points along а 
graduated interpretive difficulty continuum also appear plausible. 


Need to Accept Correlation Matrix as Input Data 


Most researchers using multivariate techniques have encountered 
the problems associated with a package of programs that will not ac- 
cept a correlation matrix as input data. An example might be that a 
researcher would desire to use the missing data options of Statistical 
Package for the Social Sciences (Nie, Bent, and Hull, 1970) to create a 
correlation matrix and then would wish to complete a canonical cor- 
relation analysis. Since SPSS does not have a canonical correlation 
program and since the other most popular packages that do have à 
canonical correlation program may not accept a correlation matrix as 
raw data (BMD 09M, Dixon, 1973; Statistical Analysis System, Barr 
and Goodnight, 1972), the researcher may be at least temporarily 
stalled. The suggestion is that a sample manifesting the exact соуап- 
ance matrix and centroid of the sample of interest could be created and 
these score vectors could be used as raw data. The results of such a run 
would be identical to using the original score vectors as input. This 
same problem would occur anytime a researcher wishes to do 
statistical analyses from a correlation or covariance matrix from the 
literature or elsewhere and finds that the common statistical packages 


Tue accept the reduced input (Kerlinger and Pedhazur, 1973, P: 


Use in Monte Carlo Studies 
The last use is in Monte Carlo studies in which the methodological 


ed wishes to draw samples from a population of score veton 
with a desired centroid and covariance matrix. 


Related Efforts 


n Collier, Baker, Mandeville, and Hayes (1967) described а method 
ог creating a population which manifests "approximately" а 40816 
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covariance matrix. However, this method gives only an approximate 
solution. There apparently has been no information provided 
regarding the method's relative accuracy. Moreover, the degree to 
which the covariance matrix of the created sample fits that desired is 
contingent upon the size of the sample. Further literature search 
produced no documentation of any other proposed methods or com- 
puter programs. 


Computational Procedure 


A population with the desired number of approximately random 
normal deviate score vectors is created by the Box and Muller (1958) 
method as modified by Marsaglia and Bray (1964). Each score vector 
has a number of components equal to the number of variables desired 
in the covariance matrix. Small intercorrelations between the random 
normal deviate variables are eliminated by triangularly decomposing 
(Guertin and Bailey, 1970) the random normal deviate intercorrelation 
matrix and by calculating factor scores which are thus standardized 
and independent. Each of these vectors is then premultiplied by the 
triangularly decomposed covariance matrix desired. The vector of 
desired means is then added to each score vector. This procedure 
results in the population with the desired number of score vectors with 
the desired centroid and covariance matrix, and by the central limit 
theorem the score vectors should be marginally normal. This nor- 
mality has been consistently empirically verified with tests of proper- 
ties of the resulting distributions (McNemar, 1962). 


Program Input and Output 


An eight digit random seed number must be supplied to the 
program in the first eight columns of the first input card. On the зес- 
ond card the number of score vectors and the variables desired are 
entered in five column fields. In addition a “1” is punched if the special 
case of independent score vectors is desired. This option allows the 
program to skip the decomposition of a desired diagonal covariance 
matrix. The third card contains the variable format necessary for 
reading the input centroids and covariance matrix. If independent 
score vectors were chosen, one must then merely enter a vector of 
variable variances desired; otherwise, 
matrix must be entered. 

The score vectors desired are punched, and th 
calculated and compared with that input. The resulting difference 
matrix of errors is then printed. The errors are never larger than .01%; 


the entire variable covariance 


е covariance matrix is 
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however, if more precision is desired, double precision arithmetic 
could be easily instituted. 


Availability 


The program with complete documentation, including a sample 
card set up and output are available upon request from the author. 
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CORALL, A FORTRAN IV PROGRAM FOR 
CORRELATION MEASURES 


JAN VEGELIUS 
University of Uppsala 


The program can compute a great number of different correlation 
and other statistical measures. The user is free to select among the 
measures and also among the variables that are read by the program. 
When a particular set of variables has been treated in the prescribed 
way, a new set may follow together with new measure definitions. 


IN Vegelius (1973) a great number of different correlation measures 
is described. Most of them (together with various other elementary 
statistical measures) are available in a FORTRAN IV program called 


CORALL (Vegelius, 1974), constructed in a way similar to the 
SIEGEL program (Vegelius, 1971). 


Contents 


This program makes available 20 correlation measures, 3 measures 
of central tendency, 12 other elementary measures, 4 instructions for 
file treatment and 7 other instructions, 6.8.» transponation of the data 
matrix, comments for the output, dichotomization of the data, 
and treatments of missing data. The tetrachoric correlation coefficient 
Will be computed from the first terms in the power series. The number 
of terms may be chosen by the user. Up to 36 terms are possible. 


Input 


The values are normally to be read variable-wise, but it is also possi- 
ble to read person-wise and to transpose the data-matrix. Both 
variable-format input and format-free input can be used. If one card 
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(containing 80 columns) is not enough when punching the format, ир 
to 4 extra cards тау be used. When the data are read from cards or 
from tape, the measure instructions may follow. It is then possible to 
select any of the samples and any of the measures. After one measure 
has been chosen, a new one may follow. There is no upper limit to the 
number of measure instructions following the data input. 


Example: If the data have been read, and the means, the standard 
deviations, the indexes of skewnesses, the indexes of kur- 
tosis of all the variables, and the product moment correla- 
tion coefficients between each variable pair are desired, the 
card order may be (each card beginning in the first column): 
MEAN 
STDEV 
SKEWNESS 
KURTOSIS 
PEARSON 
* 


Restrictions 


The program has been written in FORTRAN IV for use on an IBM 
360/370. As all measures are available in the memory at the same time, 
the program works rather fast. This kind of availability means, 
however, that it is not possible to treat a great amount of data at the 
same time. In the three existing versions, the upper limit to the number 
of data is 2000 (156 K), 9000 (208 K), and 25,000 (416 K). 


Availability 


A description of the program тау be obtained from Jan Vegelius 


Department of Psychology, Svartbà 753 20 Uppsala, 
SWEDEN. gy, Svartbücksgatan 10, S 
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SIEGEL, A FORTRAN IV PROGRAM FOR 
NONPARAMETRICAL METHODS 


JAN VEGELIUS 
University of Uppsala 


The program can perform any one of those nonparametric 
statistical methods that is mentioned in Sidney Siegel’s classical work 
on the subject. The user is free to select among the methods and also 
among the samples that are read by the program. When a particular 
set of samples has been treated in the prescribed way, a new set may 
follow together with new method definitions. 


IN Siegel (1956), a large variety of nonparametric statistical methods 
is described. АП of them (together with a method for multivariate in- 
formation analysis, described by Attneave (1959)) are available in a 
FORTRAN IV program called SIEGEL (Vegelius, 1971, 1974a, 
1974b). 


Contents 


This program makes available 4 one-sample methods, 16 two- 
sample methods, and 8 methods for more than two samples. For file 
treatments, there exist 4 instructions. Finally, six extra instructions are 
available, e.g., transponation of the data matrix (if it is square), com- 
ments to be written in the output, and treatments of missing data. 


Input 


[= 

The values are normally to be read sample-wise, and it is possible to 
have different sizes of the samples. Both variable-format input and 
format-free input can be used. If one card (containing 80 columns) is 
not enough when punching the format, up to 4 extra cards may be 
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used. When the data are read from cards or from tape, the method in- 

structions may follow. It is then possible to select any of the samples 

and any of the methods. After one method has been employed, a new 
one may follow. There is no upper limit to the number of method in- 
structions following the data input. 

Example: И two independent samples have been read and Mann- 
Whitney's U-test, Wald-Wolfowitz’ run-test, and Moses’ 
test of extreme reactions are desired, the card order may 
be (each card beginning in the first column): 

UTEST 
WALD 
MOSES 
* 


Output 


For most of the methods it is possible to vary the amount of output. 
If more than 2 samples have been selected for the two-sample tests, the 
output will be the elements below the main diagonal in a square 
matrix, where the elements have been obtained by a comparison 
between all possible sample pairs. The values used are probabilities ог 
correlation coefficients. For further details, see Vegelius (1971). 


Restrictions 


The program has been written in FORTRAN IV for use on an IBM 
360/370. As all methods are available in the memory at the same time, 
the program works rather fast. This kind of availability means 
however, that it is not possible to treat a great amount of data at the 
same time. In the two existing versions, the upper limit for the number 
of data is 2000 (156 K), and 6000 (208 K), respectively. 


Availability 


A description of the program may be obtained from Jan Vegelius, 


Department of Psychol Uppsala, 
SWEDEN. ychology, Svartbäcksgatan 10, S 753 20 Upp 
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A COMPUTER PROGRAM FOR FISHER'S 
EXACT PROBABILITY TEST 


D. L. TRITCHLER AND D. T. PEDRINI 
University of Nebraska at Omaha 


The Fisher test is concerned with data of independent groups and 
a dichotomous criterion in a fourfold (2 X 2) contingency format. 
Typically, small samples (№ < 30) are considered. Some tables of 
critical values of Fisher's test for small samples are available to offset 
the tediousness of computation. Fisher's test requires determination 
of a combined probability, that is, the observed set added with all the 
more extreme (directional) sets. For large samples, an approximation 
method such as chi square is used. Considered in this paper are à 
mathematical discussion and algorithm, and a computer program for 
Fisher's test. The computer program for small and large (N of about 
500) samples is available from the authors. Such technology, in most 
instances, makes approximation methods unnecessary. 


cance of a difference between two independent samples or groups for 
which the scores fall into two mutually exclusive classes. This problem 
is commonly depicted in the form of a fourfold or a 2 X 2 contingency 
table. As a technique for analyzing such data, Fisher's exact test 
provides a precise probability. Since this test becomes impractical for 
hand computation in the case of all but the smallest samples, usually 
an approximation method such as chi square is chosen. Considered in 
this paper are a mathematical discussion and algorithm, and a com- 
puter program for Fisher's test. 


Mathematical Discussion 


‚ If the marginal totals of a given 2 X 2 contingency table are con- 
Sidered fixed, the exact probability of a given set of frequencies, given 
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the null hypothesis of no difference in population proportions, is given 
by the hypergeometric distribution. In Fisher's test the null hypothesis 
requires a determination of the combined probability of the observed 
set of frequencies and all sets of frequencies more extreme than the 
observed (for example, see Blalock, 1972, pp. 287-291). Tritchler and 
Pedrini (1974) developed a mathematical algorithm which is com- 
putationally efficient and accurate. 


The Computer Program 


Tritchler and Pedrini (1975) developed a FORTRAN computer 
program to implement the algorithm referred to above. The program 
is appropriate for sample sizes ranging from multiple zero cells to an N 
of about 500 (as analyzed on an IBM 360/65). 

Of course, for small sample sizes, desk calculations may suffice 
Some nonparametric texts (for example, Siegel, 1956, pp. 256-270) in- 
clude tables of critical values of Fisher's test for small samples. But for 
large sample sizes and for exact probabilities, a computer program 
seems mandatory. 


Input 


The data for a given problem are entered on the data card. Let ab 
be the class frequencies for one group and let с, d be the respective 
class frequencies for the other group. Then the integers are entered on 
the card in the order a, b, с, d, separated by blanks. The input 18 free 
format, that is, the integers may be anywhere on the card. 


Output 
totals, and 


Output consists of the contingency table with margin 
р $ of the contingency le output i$ 


Fisher's exact probability for the one-tailed case. A samp 
shown below: 


OUTCOME 
G + os 
E qux T — 
о У 444 185 5 
U 478 221 6 
2 TER) = 
P (OCCURRENCE ОЕ А DIFFERENCE THIS EXTREME ОК GREA 
0.0002 NOTE—THIS IS A ONE-TAILED TEST 
Capabilities and Limitations di | 
: ; ; with à 
The algorithm used by the program is quite r the fre- 


imum relative error of (ба + 2c — 2)/104-! (where a ап 
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quencies in the observed cells A and C, and M is the number of signifi- 
cant digits of a double precision number for the machine being used). 
Because of the large values involved in computing factorials, values 
are repeatedly scaled to prevent exponent overflow. If the procedure 
breaks down (contingent upon the sample size, the distribution among 
cells, and the computer used), the fact is noted in the output (Tritchler 
and Pedrini, 1975). Robertson (1960) and Gregory (1973) also have 
discussed computer programs for Fisher's test, although Robertson 
has not mentioned maximum sample size limitations, Gregory has 
defined an upper limit of N — 100 for his program. 


Availability 


A copy of this article and a program listing can be obtained from D. 
L. Tritchler, Computer Center, or D. T. Pedrini, Psychology Depart- 
ment, University of Nebraska at Omaha, Ohaha, Nebraska 68132. 
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A COMPUTER PROGRAM TO DETERMINE RELATIONS 
AMONG GENUINE DICHOTOMIES: 
THE PHI AND G STATISTICS 


HOWARD CHAMBERLAIN anp DAVID D. VAN FLEET 
Texas A&M University 


The Phi and G statistics for dichotomus variables are discussed 
and a Fortran program to compute them is described. Input is to be 
in card form, output may be printed, punched, or placed on magnetic 
tape. The punch or tape output is designed to be used as input for the 
BMD X72 factor analysis program. 


IN some types of research the responses of the subject are in the 
form of true dichotomies; such as, yes-no, true-false or existence or 
absence of a given condition. Two statistics suggested for this situa- 
tion are Phi and G. 

Phi is the product moment coefficient between pairs of genuinely 
dichotomous variables. Except under certain conditions, it is not 
Symetrically distributed about zero and does not range from —1.0 to 
+1.0 (Guilford, 1965). Additionally, Phi may have a sign indicating a 
Clearly nonlogical direction of relation (Holley and Guilford, 1964). G 
is based upon the probability of agreement of response, is symetrically 
distributed, and ranges from —1.0 to +1.0 (Holley and Sjoberg, 1968). 
While Phi is sensitive to skewness of the marginal distributions, G is 
Not (Holley and Eriksson, 1970). 4 

The purpose of this paper is to present a computer program that will 
calculate the Phi and G coefficients. It is designed to take up to 400 
Observations and 150 variables or questions. However, the user may 
adjust the program to fit his requirements. 

Input: The input must be in card form with responses punched 
as zeros and ones or as blanks and ones. The user con- 
trols the form of the input through a format card. 


Copyright © 1975 by Frederic Kuder 
721 


722 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Output The output will be a table of values in the form required 
as input to the BMD X72 factor analysis program, 
Through use of a control card any combination of 
printer, punch, or tape output may be selected. The con- 
trol card also allows the user to suppress either the Phi 
or G table. 


This Fortran program is available from Dr. Howard Chamberlain, 
Department of Management, Texas A&M University, College Sta- 
tion, Texas 77843. 
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EDUCI LIBRARY: А DESCRIPTION OF 
FORTRAN IV COMPUTER PROGRAMS FOR 
THE IBM SYSTEMS 3/10 


RICHARD B. BALDAUF, JR. 
Department of Education, American Samoa’? 


A library of 20 FORTRAN computer programs has been com- 
piled, modified, and edited to provide in a single source a series of 
test scoring, data reduction, and evaluation programs for educators 
having access to small business-oriented computers. Summary 
details are provided for each program. 


A growing number of school systems and users who have access to 
small core business oriented computers with limited FORTRAN com- 
pilers are finding that there is no single source of suitable programs to 
solve the educational problems of test scoring, data reduction, and 
program evaluation. The EDUC! library, which was written to meet 
these needs, is a selection of twenty computer programs which have 
been edited and rewritten to conform to the FORTRAN compiler sup- 
ported by the IBM Systems 3/10 and to have a maximum size of 32K. 
(International Business Machines, 1972). When reducing program 
size, the writer made every effort to retain or add desirable features and 
to standardize input procedures so as to make the programs relatively 
easy to use for educational program managers and evaluators un- 
familiar with computer programming. 


Modifications 


Modification in the programs were of two types. First, programs 
were altered to conform to the Systems 3/10 FORTRAN compiler. 


! The programmer is indebted to Robert V. Bloedon for his initial assistance and sup- 
port, without which this project would not have been undertaken, and to Sili Atuatasi, 
Assistant Director for Research and Development, and Mere Т. Betham, Director of 
Education, who allocated the resources necessary for its completion. 

2 Now with the Faculty of Education, James Cook University, Australia. 
Copyright © 1975 by Frederic Kuder 
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For example, because variable formats are not supported, the 
programs were written with a choice of four standard formats. Many 
subroutines had to be eliminated since the compiler does not allow 
them to be dynamically dimensioned. Others were deleted because 
subroutines can not call other user supplied subprograms, nor contain 
any disk file operations (IBM, 1972). Second, program size was 
reduced by limiting the number of variables, and subroutines were 
changed or eliminated if they required separate core storage above 
that available in the main program. Programs were generally restricted 
in size to about 28K to avoid overlays which were found to increase 
processing time considerably. Many of the programs require one or 
more storage files to operate. 

These changes have resulted in a series of programs which contain 
few overlapping components. Although this simplification is far from 
ideal from the programmer's point of view, it has resulted in programs 
which can handle relatively large amounts of data with only restricted 
соге and at a reasonable cost. Users with computers having less than 


32K could use many of the programs in the library by reducing the 
dimensioned size of the variables, 


Descriptions 


Table | lists the programs’ names, provides a brief description of the 
analyses computed, and gives the original program source. The 
EDUCI users Manual (Baldauf, 1974) furnishes a detailed description 
of how to set up and to Operate each program along with a brief sum- 
mary and an example. Users of these programs will probably want to 
consult the original sources for each to obtain descriptions of com- 
putational details, program functions, and uses. 


Characteristics 

Table 2 lists for each 
capabilities, and limitatio 
library provides a Series 


Program details concerning program size, 
ns. In addition to the programs listed, the 
of simple utility routines. 


Availability 
Copies of the EDUCI 
source listings, and progra: 


the Supervisor of Тез 
American Samoa. 


users manual, sample test data, program 
gram source decks are available at cost from 
ting, Department of Education, Pago Pago, 
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FORTRAN IV PROGRAM ТО DETERMINE THE 
PROPER SEQUENCE OF RECORDS IN A DATAFILE' 


MICHAEL P. JONES лмо ROLAND К. YOSHIDA’ 


Neuropsychiatric Institute-Pacific State Research Group 
Pomona, California 


This FORTRAN IV program executes an essential editing 
procedure which determines whether a datafile contains an equal 
number of records (cards) per case which are also in the intended se- 
quential order. The program which requires very little background in 
computer programming is designed primarily for the user of 
packaged statistical procedures. 


Ы the preparation of datafiles for analysis, various types of errors 
arise which violate the requirement of a sequentially ordered file with 
E ae number of records (cards) per case. The purpose of this paper 
E ү describe a program that is designed to identify two common diffi- 
р ties: (a) the existence of too few or too many records for a case and 
p id improper sequencing of cards. This program is especially useful 
Een who have relatively little background in computer 
‘such m and who rely primarily upon packaged statistical programs 
ach as SPSS (Nie, Hull, Jenkins, Steinbrenner, and Bent, 1975) and 

D (Dixon, 1973). 


Input Information 


о: j 

B. is no limit on the number of records in the datafile of interest; 
ource may be from card image, disk, or tape. Case and card 

ا — 

‘Thi 

lapped ort Was funded in part by a U.S. Office of Education, Bureau of the Han- 

Tell i odds OEG 0-73-5263. The opinions expressed herein do not necessarily 

bY the US us or policy of the U.S. Office of Education, and no official endorsement 

"Also at h се of Education should be inferred. 
pyri At the University of Southern California. 
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numbers may be located in any column of a record, and two inclusive 
ranges of card numbers may be searched such as 1-9 and 15-23. 
The routine, which is written in FORTRAN IV, requires these con- 
trol cards: 
1. PARAMETER CARD: (mandatory) Format (2A4, A2, 12, ІХ, 
4(12, 1Х)) 
Col 1-9 Code PARAMETER 
Col 11-12 (mandatory) Identifies the input source. Any unit 
number (right justified) from 1-99 except 6 and 7 is valid 
(5-card reader only). 
Col 14-15 (mandatory) Identifies the beginning card number of 
the first sequence (right justified). 
Col 17-18 (mandatory) Identifies the ending card number of first 
sequence (right justified). 
Col 20-21 (optional) Identifies the beginning card number of the 
Second sequence (right justified). 
Col 23-24 (optional) Identifies the ending card number of second 
sequence (right justified). 
. FORMAT CARD: (mandatory) Format (2A3, 18A4) 
Col 1-6 Code FORMAT 
Col 7-80 Code FORTRAN input format for case number and 
card number, 


N 


The data cards follow the format card. If the input source is disk or 


tape, insert the proper FT statement corresponding to the unit coded 
on the PARAMETER card. 


Limitations 

The following limitation: 

1. Case IDs, 
in length. 

2. Card numbers must range between 1 and 99. 


3. Case numbers are not searched for sequence, duplication, or in- 
clusion in the datafile, 


s apply to the card order program: 
Which must be numerical, may not exceed 9 characters 


Output 


F or each datafile, the output specifies the card numbers of each case 
which violates the sequence and the range of card number values 


provided on the prarmeter card and gives the total number of cards for 
the datafile of interest, 


Availability of Program 
A listing and write-u 


ў р of the card order program along with sample 
input and output can 


be obtained by writing to Roland K. Yoshida, 
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І Neuropsychiatric Institute-Pacific State Research Group, P. O. Box 
© 100-R, Pomona, California 91766. 
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A FORTRAN PROGRAM FOR ANALYZING THE 
RESULTS OF FLANDERS' INTERACTION MATRIX: 
AN UPDATED VERSION 


VINCENT RACIOPPO anp PIETRO J. PASCALE 
Youngstown State University 


GAVIN DOUGHTY, JR. 
Tarkio College 


This paper presents a revised and updated version of a 
FORTRAN program which computes all indices used in the 
Flanders’ Interaction Matrix. The new program has added another 
form of data input which simplifies data entry. The new version also 
has the capability of interactive terminal use. 


Racioppo and Pascale (1974) developed a FORTRAN program for 
analyzing the outcomes of Flanders' Interaction Matrix. The original 
Program has been updated and improved. Several improvements such 
а$ the capability for use of the program on keyboard terminals have 
been developed. The revision feature of this program which supports 
keyboard terminal use is the new manner of data entry. Data entry for 
the original program required data to be in matrix form. In other 
Words, the user had to hand tally the data into а ten by ten matrix. 


Purpose 


my purpose of this paper is to make known a new program that 
и vides for data entry in the form of Flanders' category numbers in 
th есі correspondence to the observation. The program in effect builds 
© ten by ten matrix. 


Program Characteristics 


t the observer records is the input which opens the possibility 
Putting a portable keyboard terminal into a classroom and 


ease 
"PYrieht © 1975 by Frederic Kuder 
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recording the categories directly onto the keyboard. There is virtual in- 
stantaneous evaluation of both the configuration of the ten by ten 
matrix and the various indices. The revised program also defines on 
comment cards all the computed indices. 


Availability 


Those who have requested the initial program will receive the up- 
dated version. Copies of the manuscript, a complete documentation, a 
listing, and the program punched on cards can be obtained from Dr. 
Pietro J. Pascale, Youngstown State University, Foundations of 
Education, Youngstown, Ohio 44555. 
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A COMPUTER PROGRAM FOR CALCULATING AN 
INDEX OF INTEROBSERVER RELIABILITY 
FROM TIMESERIES DATA 


BILLY W. THORNTON 
Boys Town, Nebraska 


FRANK L. CROSKEY 
Shawnee Mission School District 


The purpose of this paper is to describe a method of indexing in- 
terobserver reliability from data relating to mutually exclusive 
nominal categories. The computer program to be described com- 
putes estimates of reliability from nominal data collected in a time 
interval manner. A specified critical time interval, the order of the 
responses, and the number of responses are important factors in 
computing the amount of agreement between judges. 


THE purpose of this paper is to describe a method of indexing in- 
erobserver reliability from data relating to mutually exclusive 
Nominal categories. The computer program to be described computes 
estimates of reliability from nominal data collected in a time interval 
manner. A specified critical time interval, the order of the responses, 
and the number of responses are important factors in computing the 
amount of agreement between judges. 


Background Information 


й The computational procedures used in this program parallel steps 
ined by Scott (1955). The formula for Scott's Coefficient of 
nterobserver Reliability is: 


Za a 
ЕЕ = 


Cen 
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Ро (observed percentage of agreement) represents the percentage of 
judgments on which two observers agree when coding the same data 
independently; P, is the percentage of agreement to be expected by 
chance. Thus, т is the ratio of the actual difference between obtained 
and chance agreement. Procedures for the hand calculation of + may 
be found in Flanders (1967, pp. 158-166). Ordinarily, the value of P, is 
computed by considering number of agreements across categories for 
the complete set of data for two judges. This program computes Р, for 
two judges by summing the number of agreements within each time in- 
terval across time. Thus, P, for judge j, and judge j, is expressed as fol- 
lows: 


. number of agreements 

number of total responses 

The number of agreements for two judges is defined to be the sum over 
time intervals and categories of the minimum frequency of common 
observations per cell. For example, in a given time unit and for a 
specific category, if judge (jı) indicated three observations of the event 
and judge (j;) indicated four Observations, then the two judges agree 
three times and disagree one time. It is assumed that if two judges in- 
dicate ап event occurred, then there is agreement. However, if one 
judge indicates an event occurred and the other did not, then there isa 
disagreement. Therefore, the number of disagreements for two judges 
is the sum of the absolute value of the differences between the number 
of responses in corresponding cells. The total number of responses is 
the sum of the number of disagreements and number of disagreements. 


о 


General Description of the Computer Program 


The program is an interactive time-sharing FORTRAN program. 
The program asks a series of questions, the answers to which specify 
the number of. judges, the time interval, the title of the run, and the 
name of the input data file. The input for the program consists of a 
three dimensional (judges, time, categories) data matrix; the entries of 
which are the number of observations for each cell. 

М Өмен consists of the number of agreements and disagreements 
БЫ each pair of judges and interobserver reliability coefficients 
er ас pair of judges. The coefficients are output in matrix form. The 
computer program allows the user the option of combining sets of 
consecutive intervals for which new coefficients are computed. The 
Process can be repeated as desired. If the interval length equals the 
total time segment, the reliability coefficients between judges will at- 
tain maximum values, Thus, by comparing the magnitude of reliability 
coefficients from small intervals with reliability coefficients from larger 
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intervals, characteristics of the data are revealed. In interpreting the 
reliability coefficient, the user should keep in mind the possibility of a 
time lag between judges. 


Program Availability 


The computer program which is written in time-sharing 
FORTRAN, operates on a Honeywell 635. A listing of the program is 
available on request from B. W. Thornton. 
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BOOK REVIEWS 


C. Mauritz Lindvall and Anthony J. Nitko. Measuring Pupil Achieve- 
ment and Aptitude. (2nd. ed.) New York: Harcourt Brace 
Jovanovich, 1975. Pp. xi + 237. $3.95 (paperback) 


The first chapter of this book specifies the areas of pupil evaluation: 
(1) achievement, (2) aptitude, (3) interest, and (4) personality. The 
procedures of assessing these areas are then briefly outlined. The 
chapter concludes with several references relevant to the historical 
development of testing and evaluation and to more extended discus- 
sion of interest inventories and personality tests. 

Chapter 2 is concerned with the planning of instruction and of 
evaluation. The emphasis on instructional objectives, their organiza- 
Чоп and sequencing. There is brief, but adequate, discussion of such 
Characteristics of tests as validity, reliability, objectivity, and com- 
prehensiveness, but no mention of machine scoring or analysis. In 
Chapter 3 different types of tests are related to different objectives as 
defined in Taxonomy of Educational Objectives by Benjamin Bloom 
and others (including this reviewer). Chapter 4 explains the construc- 
tion of teacher-made tests. It is most adequate with reference to rules 
for writing different types of objective items, but least adequate with 
reference to the production of thought-provoking exercises. 

Chapter 5 is devoted to the interpretation of test scores. It includes 
brief discussion of the difference between criterion-referenced and 
norm-referenced testing and brief, but adequate, explanations of 
Percentile rank, mean, standard deviation, standard scores, normal 
distributions, and stanine scores. Table 5.4 presents excellent com- 
Parison of various kinds of norm-referenced scores. In Chapter 6 there 
Пен elementary explanation of coefficients of correlation and of 
the kinds of test validity and of the means of assessing test reliability. 
pier 7 explains the process of constructing and standardizing a 
Pee enced achievement test and describes several widely used 
i сен test batteries including the California Achievement Tests, 
men RUM Tests of Educational Development, the Metropolitan Achieve- 
on ests and the Stanford Achievement Tests. Similarly Chapter 8 is 
Bro *rned with scholastic aptitude—the Stanford-Binet and such 

Up tests as the Otis-Lennon. 
hapter 9 deals with testing and evaluation in the individualizing of 
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instruction while Chapter 10 describes the planning and implementing 
of a comprehensive evaluation program. Some attention is given to the 
use of rating scales. 

With some supplementation, this text would be an excellent basis of 
instruction for an introductory course in educational and psy- 
chological measurement. 


` 


Max D. ENGELHART 


Jerome M. Sattler. Assessment of Children’s Intelligence. (Rev. reprint) 
Philadelphia: W. B. Saunders, 1974. Pp. xxii + 591. $14.95. 


The author lists three goals of this book: (1) to assist students with 
the process of psychological evaluation; (2) to bring out the findings 
and insights of many pioneer clinicians, educators, and investigators; 
(3) to summarize and integrate the findings of studies concerned with 
individual intelligence tests and variables in the testing situation. The 
major assumption underlying these goals is that individually- 
administered intelligence tests can yield much more than just an IQ. 
Furthermore, it is maintained that group-administered tests are less 
sensitive than individual tests to the cognitive and conative features of 
personality. 

This large book possesses features of both a textbook and a source 
book, and this reviewer readily recommends it for graduate courses on 
intelligence testing. The book is well-written and clear, but the reader 
will need some knowledge of tests and measurements, developmental 
psychology, and abnormal psychology in order to derive the greatest 
benefit. An attractive feature for instructions is a manual of multiple 
choice questions. These questions, however, should be viewed as only 
a supplement to a more thorough evaluation of the students from 
observations of their performance іп test situations and the quality of 
case reports. 

The 27 chapters comprising the book are grouped into six sections: 
1. Introduction and General Considerations; 2. Administering In- 
dividual Intelligence Tests; 3. Stanford-Binet Intelligence Scale; 4. 
WISC, WPPSI, and Other Tests; 5. Diagnostic Applications; 6. 
Psychological Reports and Consultation. The core of the book con- 
sists of detailed descriptions of the development, administration, and 
interpretation of the Stanford-Binet, WISC, and WPPSI. These 
chapters (8-17) by themselves constitute almost an entire course on in- 
telligence testing of children. The reviewer found, however, that some 
of the most informative and useful material appears in Sections 2, 5, 
and 6. Admittedly, Section 5 (Diagnostic Implications) is a bit disap- 
pointing if One expects it to present clear-cut methods of diagnosing 
and prescribing for various disorders and exceptionalities. Such 
straightforward methods, of course, do not exist, and Jerome Sattler is 
no Pollyanna. Rather he is an empiricist and compiler who recognizes 
the limitations of intelligence tests for diagnostic and prescriptive pur- 
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poses. Consequently, hundreds of empirical investigations are cited in 
this book, but students and practicing psychologists who entertain un- 
realistic expectations about the diagnostic abilities of individual intel- 
ligence tests may “come out by that same door wherein (they) went." 

Inshort, the twenty-seven chapters make up a comprehensive hand- 
book on intelligence testing of children. Complementing these 
chapters are six appendixes: A. Supplementary WISC Scoring 
Criteria; B. List of Validity and Reliability Studies for the Stanford- 
Binet, WISC, WPPSI, PPVT, Quick Test, Leiter, and Slosson; C. 
Miscellaneous Tables; D. Stanford-Binet Intelligence Scale, Form L- 
M, 1972 Norms; E. Wechsler Intelligence Scale for Children-Revised 
(WISC-R); F. WISC-R Tables. The last three appendixes were added 
in the revision to accommodate the 1972 Stanford-Binet norms and 
the WISC-R. Anyone planning to use the WISC-R would be well ad- 
vised to examine Appendixes E and F of this book before proceeding. 

Unfortunately, not all individual intelligence testers possess the in- 
terpersonal sensitivity and statistical-technical expertise needed to 
draw sound diagnostic conclusions from test responses. To be sure, 
Jensen and Shockley have had some negative effects on the public im- 
аре of intelligence testing. But this reviewer is convinced that 
vociferous critics of intelligence testing find ample ammunition in the 
errors of poorly-trained, and perhaps poorly-endowed, psy- 
chodiagnosticians. Many of these itinerant Binet-testers and WISC- 
testers might benefit from a serious study of Sattler’s book. He has 
performed a useful scholarly service in collecting under one cover a 
great deal of material concerned with intelligence testing of children. 


Lewis R. AIKEN, JR. 


Warren үу, Willingham. College Placement and Exemption. New 
York: College Entrance Examination Board, 1974. Pp. xv + 272. 
$6.95 and $4.95 (paperback). 


Sota order out of chaos, or at least out of great diversity, is no 
this oe but this was one of the problems confronting the author of 
dif ext. American education is known for its diversity and these 
vh erences are clearly illustrated in the wide range of prevalent policies 
Be preter in college placement and exemption. As stated by the 
1 Or the primary purposes of this text are: 
. to develop a framework that would include the most important 
types of placement and exemption and closely related models and 
2 to help clarify the relationship among them. "n 
. to describe the educational rationale and technical characteristics 
3 of these models. 
^ to review fairly thoroughly the relevant research literature. 
m accomplishing these purposes the primary aim “was to en- 
vg Ве On individual campuses more systemic analysis of the objec- 
$ and outcomes of these various models of sorting students into 
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alternate educational treatments." Even though, as expressed by the 
author, his intent was not to produce a *how to" handbook, this text 
is intended for responsible practitioners such as college administrators 
and faculty as well as researchers and directors of testing. Related im- 
portant topics intentionally not covered by this text include various 
aspects of implementing placement and exemption programs, 
academic advising, and providing students with necessary informa- 
tion. Efforts were made in preparing this text to avoid the technical ap- 
proach wherever possible. 

This text consists of seven chapters. The first chapter deals with the 
statement and nature of the problem, the second provides a somewhat 
technical rational for the models to be discussed, and chapters 3 
through 6 present, respectively, four classes of alternate treatments, as- 
signment, placement, selection and exemption, and twelve derived 
models under these classes. Chapter 7 presents the conclusions and im- 
plications. An annotated bibliography containing approximately 80 
sources is a valuable asset of this text. 

Overall, the author of this text has done an admirable job in at- 
tempting to take widely diverse educational practices and programs 
and Structuring them into a rational framework that could aid prac- 
titioners. There is no doubt that this type of endeavor has been long 
needed, especially in view of the increased access to higher education 
over approximately the last decade. As discussed in Chapter 1, this 
almost unlimited free access has greatly increased the heterogeneity of 
students attending higher education institutions, and for various 
Teasons institutions have developed a keen desire to respond to the in- 
dividual needs and interests of students. Therefore, accommodating 
education to individual differences has received and will continue to 
receive greater emphasis. A second stated purpose of this text was to 
describe the educational rationale and technical characteristics of 
placement, exemption, and closely related models and to help clarify 
the relationship among them. This is not only a most ambitious pur- 
pose, but almost an impossible one to attain considering the complex- 
ity of measurement and evaluation and the variance between measure- 
ment theory and practice, In Chapter 2 the author states that “decision 
theory does suggest a general framework for considering problems of 
Шешшу educational treatments and it does focus attention on the 
tuor hte aan М ede эд.” Yat, as poled by de 
one of the models of Testrictions to decision theory. In discussi : 
variation," the se ojo Пепе” in Chapter 3, i.e., “metho 
абу Purpose of which is to consider ways of identifying the 
3 ment interactions that can make it possible to adapt instruc- 
Не kis to individual differences, the author mentions the 
AES. ice current Tesearch and states that this is still a research 
dimi eda: not an educational Strategy. It is this type of problem that 

Iminishes to some extent the usefulness of some of the models 
presented in this text, for example, in matching students with teachers 
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with certain characteristics (model 2). Despite this important limita- 
tion, as well as lack of adequate consideration of relevant significant 
psychometric problems, the models presented in Chapters 3 through 6 
represent an outstanding attempt to bring some rational structure out 
of extreme diversity. Chapter 7, the conclusions, is an excellent discus- 
sion of the implications of these models in view of our educational set- 
ting and some of the current problems such as articulation. 

This text should be of immense value, not only to developing institu- 
tions, but to well-established institutions with placement and exemp- 
tion programs. Its primary value lies not in how to develop and imple- 
ment such programs, which is very much needed but beyond the scope 
of this book, but in giving institutions a framework upon which to 
proceed in developing such programs or a framework upon which to 
evaluate their existing programs. The author has, indeed, ac- 
complished another purpose by reviewing extensively the relevant 
research literature, and even though much of the research presented 
cannot be considered as significant research, it nevertheless can be of 
help to individuals in getting a feel for the “lay of the land." Fragmen- 
tation of placement and exemption programs within institutions is a 
common symptom; this text can be profitable in the improvement of 
a programs with the end result of better meeting the needs of stu- 

ents. 


HENRY MOUGHAMIAN 
City Colleges of Chicago 
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GROUP SIZE EFFECTS IN EMPLOYMENT TESTING' 


JOSEPH M. HILLERY anp STEPHEN S. FUGITA 
University of Akron 


Effects of the number of individuals (1 to 10) coacting while taking 
two standardized motor performance tests were examined. Scores on 
the manual and finger dexterity sections of the General Aptitude Test 
Battery were collected from two state employment agencies for 2,261 
actual applicants. Increases in aptitude scores corresponding to in- 
creases in group size were predicted based upon the summation 
hypothesis of social facilitation theory. Results indicated a group size 
effect with performance appearing to increase somewhat linearly 
With increases in number of coactors. The implications for social 
facilitation theory and the interpretation of tests administered in a 
Broup setting were discussed. 


SiNCE the publication of Zajonc's (1965) paper drawing on Hull- 
Spence Theory (e.g., Spence, 1956) to reconcile the conflicting findings 
n the social facilitation literature, a considerable amount of research 
has been conducted. Much of the published research supports Za- 
JOnc's hypothesis (cf, Zajone, 1972). Zajonc proposed that the 

Présence of others increases general arousal or generalized drive, and 

ES According to the multiplicative drive law, enhances the 
| 3 ability that dominant responses will be emitted. If the dominant 

E. nses are correct, as in simple, well-learned or instinctual 

bu lors, performance will be improved. If, on the other hand, the 
E responses are incorrect as is likely if the behavior is poorly 
l Probab; ‚ОТ complex, performance will be impaired since the 

ability of incorrect responses being emitted will be greater. 
"implication of an extended version of this theory is that as the 


i 
Т 
of г Authors would like to acknowledge the generous assistance of Geoffrey Johnson 


Ziegler овал Employment Security Commission and Frank Lewandoski and Elwood 
Also hel n Ohio Bureau of Employment Services for their help in data collection. 
тте РІШ were Richard Н. Haude's, Раш C. Rosenblatt's, and Kenneth М. Wexley's 
Es. nts on an earlier draft. 
Tight © 1975 by Frederic Kuder 
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number of others present increases, arousal may concomitantly in- 
crease and hence, the probability that dominant responses will be emit- 
ted may be similarly increased. This summation hypothesis, in the 
audience paradigm, has been discussed by Weiss and Miller (1971). 
Although summation is a seemingly simple way to manipulate drive, 
few social facilitation studies have focused on the group size factor 
(Dorrance and Landers, 1974). Brenner (1974a) using the Psycho- 
logical Stress Evaluator (an instrument which monitors voice 
patterns believed to be associated with arousal in the central nervous 
system) reports that arousal increases as a power function of audience 
size in a public speaking situation. His audience group sizes were 0, 2, 
8, and 22 spectators. Dorrance and Lander's experiment also partially 
supported the hypothesis that increases in activation are positively 
related to increases in audience size. Furthermore, they report an 
audience size X task type interaction using two different kinds of 
motor performance tasks. 


In an early investigation of the effects of an audience on motor and 
cognitive performance, Gates (1924) reported the possibility of a slight 
"stimulating" effect with a large audience (27 to 37 members) on a 
word naming task. More recently, Burwitz and Newell (1972) report 
that on a novel motor skill where subjects coacted alone, in dyads, or 
in tetrads, there was no difference between the alone and dyad condi- 
tions, but both were significantly superior to tetrads. Martens and 
Landers (1969) examined performance on a simple, well-learned 
muscular endurance task under one of three group size conditions; 
alone, in pairs, or in groups of four. Results showed that coacting in- 
dividuals in tetrads performed significantly better than those alone or 
in dyads. Martens and Landers (1972) also ran an experiment which 
examined performance during the acquisition of a complex motor 
skill. Group size was again manipulated, this time with four condi- 
tions; alone, dyads, triads, and tetrads. Results generally supported 
the hypothesis that increasing the number of coactors results in in- 
creasing impairment of complex motor performance. р 

The present study was designed to systematically test the group 12 
pae Ten group sizes which contained from опе to ten coacting in- 
dividuals were examined in a highly standardized employment testing 
situation. This field situation is a presumably ego-involving one which 
has serious consequences for those whose performance may be 
affected by any social facilitation effects, Utilizing the employment test 
setting also has the advantages of minimizing potential reactive ог © 
perimenter effects which might be associated with laboratory researe 
(c.g., Brenner, 1974b) and of providing additional information about 
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the robustness and generalizability of the social facilitation phenom- 
enon. The experimental tasks were simple, standardized motor tests. 


Method 


- Scores on the manual and finger dexterity sections of the General 
Aptitude Test Battery (GATB) developed by the United States 
| Employment Service (Dvorak, 1947) and used extensively by state 
employment offices throughout the country were collected. Four speed 
"tests which measure the two dexterity aptitudes were used. These were 
- lhe Place and Turn tests which index manual dexterity and the Assem- 
ble and Disassemble tests which index finger dexterity. Both the Place 
and Turn tests utilize a pegboard divided into two sections. In the 
| Place test, the examinee removes two pegs simultaneously, one in each 
hand, from holes in the upper part of the board and inserts them into 
ne corresponding holes in the lower part of the board. In the Turn 
dest, the examinee removes a peg from a hole, turns the peg over, and 
turns it to the hole from which it was taken. 

, Both Assemble and Disassemble tests use a small rectangular board 
Containing 50 holes and a supply of small metal rivets and washers. In 
pose test, the examinee takes a rivet from a hole with one hand 
| А ооу asmall washer from a vertical rod with the other hand. 

the ee puts a washer on the rivet and inserts the assembled piece into 
bible үш hole in the lower part of the board. In the Disas- 

Em the examinee removes the metal rivet and washer in the 
y manner from the Assemble operation. 


Subjects and Procedure 


ü Scores on th 
Collected fro 
Michigan (N 


manual and finger dexterity section of the GATB were 
m job applicants to two state employment agencies, 
With the alin 1783) and Ohio (N = 428). Arrangements were made 
Specified man в sections of both state agencies to collect data in a 
Мотур пег. Branch offices in urban areas of Michigan and the 
ngstown region of Ohio cooperated in the data collection. 

test administrators who had been trained by their respec- 
ministered тн the GATB in a highly standardized manner ad- 
Vas essential Moment Assignment of individuals to group sizes 
limes to дың om in as much as each branch office had prescribed 
Бор of idi ister the test. Thus, the instrument was given to that 
Ividuals who arrived after the last administration but 


"Mperienced 
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before the administration in question. In the Michigan sample, each 
test administrator was asked to send 75 test record cards or those given 
in a three month time period. In the Ohio sample, the procedure was 
modified somewhat to insure that an adequate number of persons 
tested alone was included in the total sample. Branch offices were 
asked to alternately send the scores of persons tested individually and 
for the next group of persons tested regardless of the number of people 
in the group. Data were also gathered to determine whether any poten- 
tial group size effect might interact with the sex, age, educational level 
or race of the applicant. The GATB is administered with the appli- 
cants seated around a large rectangular table. Thus, they can see and 
hear each other as they work on their individual tests. 


Results 


Since the data from the two samples should have been and were 
similar, they were combined for purposes of data analysis. The 
number of applicants in each group size condition, cell means and 
standard deviations are presented in Table 1. The ANOVA indicated 
a clear group size effect with an upward trend in aptitude scores which 
corresponds with increases in number of coactors in both manual (F = 
4.54, df = 9/2251, р < 001) and finger dexterity (F = 2.88, df = 
9/2251, p < .002). Tests for linearity indicated that some predictability 
is afforded by the linear rule for both manual and finger dexterity (F * 
3621, df = 1/2251, p < .001, and F = 17.04, df = 1/2251, p < 00! 
respectively). The tests for deviation from linearity were nonsignificant 
for both dependent variables. Sex, age, and educational level of the ap- 
plicant did not interact with group size on either the manual or finger 


TABLE 1 


Aptitude Test Scores as Related to Group Size 
SQ س‎ 
D Manual Dexterit: Finger Dexterity 
Ине N x SD Y. sb 
калы с ———— 
146 96.88 23.38 95.48 21.51 
б 125 9402 2437 95.72 2121 
4 219 99.33 23.30 95.93 21.7 
5 292 98.37 23.44 96.08 20.25 
4 290 100.17 2347 9497 22.15 
7 320 101.34 25.14 96.31 21.23 
8 277 102.92 21.96 98.36 21.38 
9 212 103.63 22.88 98.83 21.08 
С 120 108.36 2179 103.27 21.14 


1 260 ў 
Whole Sample 206 10094 23.31 9135 21.32 
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dexterity variables. Race of applicant did interact with group size on 
manual dexterity (Е = 2.14, df = 9/2231, p < .05). Inspection of the 
relevant means indicated that whites exhibited more of the summation 
social facilitation effect. This interaction was not significant for finger 


dexterity. 


Discussion 


These data from an ongoing field setting support predictions consis- 
lent with social facilitation theory about the effects of coaction. On 
simple manual and finger dexterity tasks, subjects performed at a 
higher level when they coacted in larger groups. The relationship 
between group size and performance appears to be somewhat linear 
With the tested group sizes (however, there does seem to be a downturn 
in groups of ten). On first inspection, this appears to be contrary to 
Brenner's finding that with group sizes of 0, 2, 8, and 22 spectators, 
Subjects’ arousal increased as a power function. However, there are 
several important differences between his study and the present one. 
This study's range of group sizes was more limited albeit there was 
complete sampling within that range. In addition, Brenner's social 
facilitation situation was an audience type which focused on the 
mediating mechanism of arousal as opposed to the present coaction 
Мп which examined performance. Since the race X group size in- 
teraction was only significant оп one dependent variable, no concep- 
Іші explanation is proposed. 

Б тог, these data are important not only because of their 
8 ance to social facilitation theory, but also because they provide 
mpirical support for Guion’s (1965) contention that group as ор- 
Ее individual testing may appreciably change the character of a 
Эн) en differences across disparate group sizes may be of suffi- 
magnitude to warrant attention in interpreting scores. This might 

E! Eur ly important with GATB-type instruments in view of the 
Bind. dn norms are established with the multiple cutoff method. 
meets he ual is considered qualified for a particular job only if he 
the minimum score on each of the key aptitudes (Manual for 


G н 
eneral Aptitude Test Battery, 1967). 
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HIGHEST ENTRY HIERARCHICAL CLUSTERING 


LOUIS L. McQUITTY and VALERIE L. KOCH 
University of Miami 


This paper develops and illustrates the most rapid method yet 
ported for hierarchically clustering the n objects of a matrix which 
trays the interrelation of every object to every other object, where 
| equals any number up to 1000 and even larger. Results compare 
vorably with those from other excellent methods. 


INCENTRATED Hierarchical Clustering is a rapid and simple 
lod of clustering large numbers of objects (1,000 or more) based 
ices of interassociations between the objects (McQuitty and 
1975). The method of this paper was designed to be even 
ег and faster. 


Assumptions in Relation to Time and Effort in Analysis 


methods use the concept of a reciprocal pair of objects. Two 
and j, in a matrix of interassociations between objects, are 
lif, and only if, Object i is highest in Column and Object is 
t in Column i. 

ntrated hierarchical clustering assumes that a reciprocal pair 
ects is indicative of a cluster. The current method applies this as- 
tion to the fact that the highest entry in a matrix is reciprocal; if 
Jis highest іп a matrix, Object i is highest in Column j and Object 
езі in Column i. Consequently, the current method assumes 
the highest entry in a matrix is indicative of a cluster. 

l operational difference in the two methods pertains to the fact 
A matrix сап have more than one reciprocal pair. The earlier 
10d examines the highest entry within every column to determine 
18 reciprocal. This takes time and effort. The current method 
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makes a classification (or classifications in the case of a tie) on the 
basis of the highest entry only. It reduces the matrix by one object for 
each pair of objects classified and repeats the operation in terms of the 
highest entry in the reduced matrix. 

The current method saves time and effort with respect to another 
assumption. When the two objects of a reciprocal pair are classified 
together, only one of them is removed from the matrix (in order to 
reduce the size of the matrix), and the other one is retained in the 
matrix in order that other objects can join those already classified. 

The former method assumes that Object i of the reciprocal pair i j is 
the better representative of the initial cluster for retention in the matrix 
if and only if Object i exceeds Object j in being most like other objects 
remaining in the matrix; it is most like more other objects than is Ob- 
ject j. 

To determine which of two objects, i or j, meets the above criterion 
takes time and effort. With an emphasis on huge matrices and speed of 
analysis, the current method assumes that either member of the 
highest entry is an adequate representative of the reciprocal pair for 
the purpose of classifying other objects with it. One of them is chosen 
by chance. 

The above two alternative assumptions (one for the current method 
and the other for the former method), about which member to retain 
in the matrix, were investigated in the earlier study. The results sub- 
stantiated the adequacy of the assumption which is being applied in 
the current method (McQuitty and Koch, 1975). 


Definitions Which Generate the Methods 


‚ Тре earlier method was developed out of a relatively permissive defi- 
nition of types. A type is a category of objects of such a nature that 
every object in the category is reciprocal with one or more other ob- 
jects in the category. Object i is reciprocal with Object j if Object i is 
most like Object and Object j is in turn most like Object i. 

The current method will be generated out of a still more permissive 
definition of types. A typeisa category of objects of such a nature that 
every object in the category is most like one or more other objects in 


the category; i.e., in terms of the objects remaining in the matrix at the 
time the object is classified. 


Restrictive versus Permissive Definitions 


Whether exacting or permissive definitions should be applied in 
developing methods for the analysis of data depends on the purpose of 
the analysis. If a primary purpose is to discard irrelevant data, then а 


| 
| 
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restrictive definition is desirable. Excellent examples are the definitions 
by Thurstone (1927a, 1927b, 1927c, 1927d, 1928, 1929, and 1932) in 
the development of linear scales. However, restrictive definitions as- 
sume detailed knowledge as to the manner in which data are in fact in- 
terrelated (or need to be interrelated for some justifiable purpose). 

If the purpose is to discover through analysis the fashion in which a 
set of data is in fact interrelated, then initial methods of analyses 
usually need to be developed out of relatively permissive definitions; 
otherwise the definitions might be restrictive in such a fashion as to 
preclude discovering the nature of the interrelationships in the data. 
The definitions can be made more and more restrictive as applications 
of the permissive definitions yield insights into the actual concatena- 
tions in the data. 

If application of a restrictive definition yields in some way 
questionable results, reversion to a more permissive definition might 
be helpful. The investigator could then move gradually to more and 
more restrictive definitions as greater and greater insights into the 
nature of the interrelationships in the data are realized. 

From a societal point of view there is a more fundamental issue: the 
assumption of a particular kind of relationship may foster its develop- 
ment. The assumption of linear relationships in the assessment of 
educational achievement, for example, may cause individual 
p in intelligence to be more nearly linear than it otherwise 
would be. 


The Method 
General 


The highest entry in every column of a matrix is identified, and from 
àmongst them the highest entry in the matrix is identified. The two ob- 
Jects between which the highest entry mediates are classified together. 
The row and column of one of them is selected by chance and removed 
rom the matrix. The highest entry in the reduced matrix is identified, 
the objects between which it mediates are classified together. One of 
them is chosen by chance and its column and row are removed from 
the matrix. The process is repeated until all objects have been clas- 
sified. Every time an object, i, is classified with an object, /, each of 
them takes with it into the new classification all of the objects with 
Which each has already been classified (if any). 


a Аа ааа аб а ОИ 


Ап Illustration 


The method is illustrated with a matrix which is difficult to analyze 
| Y some methods because it yields many ties by some methods (Mc- 
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Quitty, Price, and Clark, 1968). This matrix was used in illustrating 
Concentrated Hierarchical Analysis, mentioned above. 

The matrix is shown in Table 1. The highest entry in every column is 
underlined. If there is a tie for the highest entry in a column, all tied 
entries are underlined, as in Column C, where the highest entry is 34. 

The highest entry in the matrix is isolated. It is 34, and appears in 
Columns C, F, and P. The first column and the first row in which the 
highest entry occurs are designated i and retained, and the other 
row (s) and column(s) in which it occurs are designated j and removed 
from further analysis. Chance retention and removal, in this approach, 
is achieved by using chance to assign the objects to their positions in 
the matrix. The first row and column, C, in which 34 occurs is 
designated i, and the next two rows and columns, Е and P, in which it 
occurs are each designated jı. Using the new designation, i, is most like 
Л, and Л is most like i, irrespective of which Л is intended. The 
columns and rows for both j,’s were removed by crossing them out in 
the matrix. The lines which mark out Rows F and P were terminated 
in the column of Step 1 to show that they were removed in Step 1. This 
action completed Step 1, as shown in the table. 

Step 2 is a repetition of Step 1, just outlined, except that it applies to 
the reduced matrix. The highest entry in each column was underlined. 
There were changes only to the extent that Rows j; (Е and P), which 
Were removed, thereby changed the highest entries in some retained 
columns; the removal of Row F, for example, removed some of the 
entries in Columns, C, I, K, and S, and the new highest entries had to 
be identified and underlined (except for Column 1 which had three 
highest entries and only two of them were removed). 

The highest entry in the reduced matrix was identified. It is 33 in 
Row C— Column I and Row I—Column С. Row and Column I were 
designated j, and removed by marking them out in the matrix. ROW 
and Column C were designated & and retained in the matrix. 

No further operations were performed on Table 2. Otherwise to fol- 
low the above description would have been difficult. The complete 
analysis is shown in Table 2. 


The removal of Row I, removed the highest entries in Columns C 
and M. The highest entries in the reduced matrix for these columns 
were underlined, They are 32 in Row Q—Column C and 28 in Row 
A—Column М. 

In Step 3, the highest entry in the reduced matrix is 32 in Row C— 
Column Q and Row Q—Column C. Row and Column Q were 
designated j, and removed by marking them out, and Row and 
Column C were designated i, and retained. : 


Step 4 contains a new kind of tie. The highest entry in the reduced 
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‘matrix is 31, it mediates between Objects (1) C and М, (2) J and T, and 
(3) K and S. In the previous tie, one object, C, was involved in all of 
‘the tied values. Objects С, J, and К were designated i, i,', and ù”, 
‘respectively, and retained. Objects N, T, and S were designated ja, J4', 
"and д", respectively, and removed by marking them out in their rows 
and columns. 

- Step 5 shows 30 to be the highest entry in the reduced matrix. It 
“mediates between C and K on the one hand and between G and J on 
“the other, with C and G thus designated i, and &', and retained, and К 
nd J designated js and jẹ; and removed. 

Step 6 yields A and C highest with an entry of 29; A was designated 
and retained. C was designated js and removed. 

Step 7 yields A and M highest with an entry of 28; A was designated 
and retained, and M was designated j, and removed. 

Step 8 yields B and O and D and L with the highest entry of 27. B 
and D were designated i, and i,', respectively, and retained. O and L 
"were designated js and j,', respectively, and removed. 

Step 9 yields B and G with the highest entry of 26. B was designated 
1; and retained, and С was designated j, and removed. 

Step 10 yields the highest entry of 25 for AE on the one hand and 
BD on the other. A and B were designated ij) and йо, respectively, 
and retained. E and D were designated ji and jıo', respectively, and 
removed. 

i Step 11 yields the highest entry of 20 for A with B,, B with H, and B 
with В. A was designated і,,. B was first designated jı, and then i. H 


The Hierarchical Structure 


The classification structure is shown in Figure 1. It can be con- 
structed from either Table 2 or the description in the steps of the 
analysis as outlined above. Step 1, for example, classified C with each 
“and P under a score of 34. An asterisk was placed under each C and 
in Step 1 of Figure 1 to show that C joined F in this step, and an 
Asterisk with a prime was placed under each C and P to show that they 
Joined one another in this step. The prime shows that a tie occurred; 
the two objects with the asterisk and prime tied with the two objects 
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with the asterisk only. In case of three pairs in a tie, the members of 
the third pair are designated by an asterisk with two primes as shown 
in Step 4. 

Step 2 classified I and C together under a score of 33, and C, in ac- 
cordance with the prescribed procedure, took with it into the 
heirarchical structure the objects (F and P) with which it had already 
been classified. 

In the course of an analysis, a classification sometimes occurs 
between two objects which have not previously been classified. In cases 
like this their initial portrayal is on a separate sheet from those objects 
already classified. They are later joined to another structure if, and 
only if, one of their objects classifies with an object of another struc- 
ture. 

Building the classification structure proceeded in the above fashion 
until all objects were classified from the right hand side of the structure 
to the left hand side according to the size of the scores which joined 
them to the initial structure (except for ties); the order is arbitrary for 
tied scores, 


Proof of the Method 


Let any three objects, i, j, and k, be so chosen that they are directly 
associated in a classification resulting from the above kind of analysis: 
i is highest with j, j is highest with i; j is also highest with k, and k is 
highest with j. By definition, i and j belong to the same type, and 
likewise j and k belong to the same type. Therefore, i, j, and k belong 
to the same type, and all objects associated in a cluster belong to the 
Same type. 

An exception to the above conclusion could derive from a classi- 
fication based on only two objects in a matrix. Consequently, this kind 
of a classification has no inherent validity. An exception could not 
derive in any other way. However, the validity of classifications 
Benerally becomes less as the size of the matrices on which they are 
based decreases and as the size of indices of association decrease. 


Expediting the Method 


The method can be expedited by removing irrelevant data. Irrele- 
vant data are those entries which are not used in classifying objects. 
„Which data are irrelevant can be estimated, initially. After the 
highest entry in every column has been isolated, the lowest of these is 
termined and labeled L. The L entry of the current data is 23 and oc- 
curs in Column H, аз shown in Table 1. A tentative estimate is made 
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that all entries lower than L are irrelevant. They are removed from the 
analysis for the time being. 

The reduced set of data is then analyzed in the fashion outlined 
above for the complete matrix. The analysis continues as long as the 
method yields classifications. However, the method may discontinue 
classifications prior to all objects having been classified. 

The discontinuance of classification for the reduced data of the cur- 
rent study would have occurred at the end of Step 10. This is because 
the classification of the next step for the complete matrix was based on 
а score of 20 (which mediates between A and B, B and H, and B and 
R) and the lowest entry retained under the above criterion was 23. 

Objects A, B, H, and R were the only objects left unclassified at the 
end of Step 10. All of them had earlier had entries of 23 or above. Ob- 
ject A, for example, had had an entry of 30 with Object P, but this 
шу was removed when Object Р was removed from the matrix іп 

tep 1. 

In order to complete the analysis, the unclassified objects are reas- 
sembled in a matrix of only them (Table 3 for the current data) and the 
expedited method is applied to them in the same fashion as it was to 
the original matrix. The L entry for the reduced matrix is 20. It occurs 
in every column and, of necessity, classifies the remaining objects in 
isl same fashion as the original method when applied to all of the 

ata. 

Additional reduced matrices could have been realized in the above 
fashion if required for completing the analysis. 

One could make a more conservative estimate for removing irrele- 
vant scores from the initial matrix and thereby avoid the necessity of 
more than one reduced set of data. In the above case, if the L entry for 
Temoving entries had been reduced by 13 percent, from 23 to 20, all 
Scores of 20 or above would have been retained and the first reduced 
set of data would have been sufficient for completing the analysis. 


Enhancing the Capacity of the Method 


Description. The size of a matrix which can be analyzed can be 
enhanced if the original matrix can be partitioned into two or more 
overlapping, but of course smaller, sub-matrices, which are chosen in 

+ such à fashion that when they are analyzed separately and the results 
combined they give the same classifications which would have been 
obtained if the original matrix could have been analyzed without par- 
titioning. 

Classification by the complete method (applied to the original, com- 
plete matrix) utilizes a numerical criterion which is decreased every 
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TABLE 3 
Illustrating Analysis after Satisfying the First Criterion 
A B H R 
A 20 13 16 
B 20 20 20 
H 13 20 15 
R 16 20 15 


time a classification is completed (i.e., except for ties). The original 
criterion is the highest entry in the original matrix. The two objects 
between which it mediates are classified together. One of them is 
removed from the matrix and the other is retained. The criterion is 
reduced to the highest entry in the reduced matrix, and the process is 
repeated. The entire process is repeated until all objects are classified. 
The above facts can be utilized to partition a matrix in such a 
fashion that when the analysis is applied to them it will produce the 
same results as applying the analysis to the unpartitioned matrix. 
The highest entry of every column of the original matrix is un- 
derlined. In the case of two or more entries tying for highest in a 
column, all of them are underlined. The highest entry for each column 
is listed at the bottom of the column, thus forming a bottom row of 
highest column entries. The entries of this row are assigned ranks, with 
the largest of these entries having a rank of one, the next largest rank 
of two, etc. In the case of a tie, the highest numerical rank involved in 
the tied entries is assigned to all of the tied entries. For example, if the 
highest entry in the row is 86, followed by 84, 80, 80, 80, 80, 80, and 75, 
the assigned ranks are 1, 2, 7, 7, 7, 7, 7, and 8 respectively. The ranks 
assigned to the entries in the bottom row are assigned to all cor- 
tesponding values which are underlined in the body of the original 
matrix, but not to corresponding values which are not underlined. For 
example, if the highest entry of Column 1 is 86, of Column 2 is 84, of 
Column 3 are 80 and 80, of Column 4 are 80, 80, and 80, and of 
Column 5 is 75, these numbers would be underlined, and would be as- 
Signed ranks of 1, 2, 7, 7, 7, 7, 7, and 8 respectively, but if Column 1 
also had entries of 84, 80, and 75, they would not be assigned ranks 
because they would not be underlined as highest entries in Column 1. 
Assume that X equals the number of objects іп the largest matrix 
Which is practical or desirable for some reason to analyze (such as 
limitation of available facilities). The above ranks can be used to select 
Submatrices of size X or slightly smaller from a larger matrix. In those 
cases in which the value X occurs as a rank for highest column entries, 
the criterion for the number of objects to be included in the first sub- 
Matrix is Х. In those cases in which X does not occur as a rank for 
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highest column entries (because of ties) the criterion for the number of 
objects to be included in the first submatrix is reduced to the next 
smaller rank, numericall, which does occur. Let the criterion 
(whether it be X or the next smaller rank which occurs) be desig- 
nated C. АП objects having highest column entries with ranks of C or 
smaller, numerically, are selected for the first submatrix. 

The first submatrix is analyzed by the first stage of the method out- 
lined above for expediting the method. The smallest underlined entry 
(for being highest in a column) is designated L, and all entries lower 
than it are removed from the submatrix. The analysis of the first sub- 
matrix is completed when all of the remaining entries have been used 
in classifying objects or have been removed as other objects were clas- 
sified and removed from the matrix (along with their entries). 

When the analysis has proceeded to the above stage (where all 
entries of size L or above have been used or removed), at least one ob- 
ject will remain in the first submatrix. It is one of the two objects which 
participated in the last classification; whenever a pair of objects is clas- 
sified, one of them is removed and the other is retained in the sub- 
matrix (until all objects of the original matrix are classified). 

Other objects may also remain in the first submatrix at the above 
stage. They are the objects which had all of their entries of L or above 
removed from the first submatrix by virtue of the entries being with 
objects which were classified and removed from the submatrix. 

All objects left in the first submatrix at the above stage are trans- 
ferred back with those other objects which were not selected for the 
first submatrix, These objects constitute a matrix of unclassified objects. If 
the number of objects is still too large for the available facilities, а 
second submatrix is drawn and analyzed as outlined above for the first 
submatrix. The process of selecting and classifying by submatrices can 
be Continued until the number of remaining objects is within the limits 
of the facilities available, They are then classified without further 
reduction to a submatrix, 

Proof. When the first submatrix discontinues to yield classifi- 
a ee which could have yielded a classification 22 
NG f were available and were classified in exactly the 

me fashion as if the entire matrix had been analyzed. When a new 
bacs is specified for the next submatrix, all of the objects for 
E а the new criterion are available and in the same 
{о ШКЕ е complete matrix approach. These latter facts apply 

equent submatrices. The analysis of partitioned matrices 


produces the sam Б ias ui 
prey © results as the analysis of the original unpartitioned 
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Evaluating the Method 


The method is evaluated by comparing it with Concentrated 
Hierarchical Classification which has already been compared with two 
other versions and one other method and found to be very promising 
(McQuitty and Koch, 1975). 


Speed and Capacity 


The Highest Entry method is generally more rapid and can analyze 
larger matrices than the earlier method. This is because it omits certain 
steps: (a) it does not search for all reciprocal pairs; it uses, instead, 
only the highest entry in a matrix, and (b) it does not apply a criterion 
to determine which member of a classified pair should be retained in 
the matrix for further analyses, it selects one randomly. Empirical 
evidence that the current method is faster is given by the fact that it re- 
quired only one more step than the earlier method for analyzing the 
common set of data used in this and the earlier study, and the steps are 
executed much more rapidly in the current method. Furthermore, the 
current method has been improved by a techique for eliminating irrele- 
vant data and by matrix partitioning to increase further its capacity. 


Reliability and Validity 


A method of classification is generally assumed to be reliable to the 
extent that its classifications are based on relatively high scores. 

Table 4 shows the frequencies and accumulated frequencies of the 
Scores at which classifications were realized in the two methods. Both 
methods required 19 classifications. The size of the scores at which 
classifications occurred is identical for the two methods for the first 11 
classifications, favors the earlier method slightly for the next six clas- 
sifications, and favors the current method to a greater extent for the 
last two classifications. The means of the scores at which the objects 


TABLE 4 dod 
Frequencies and Accumulated Frequencies of Scores at Which Classifications Occurred 


34 33 32 31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 
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TABLE 5 
Order of Classification of Objects 
Position 1 2 3 4 5 6 7 8 9 1011 12 13 14 15 61 18 19 20 


Earlier 1-22. 1 2 12:2 2 
Method CFPIQNKSAMEJTGORDLHBH 


Current l2 3 ka FA 214201 2 
Method CTPIQNKSAMEJTGOBDLRH 


— = positions which classify differently objects by the two methods. 


classify are 28.05 for the current method and 27.94 for the earlier 
method. These comparisons indicate the methods to be similar with 
respect to reliability of their classifications. 

In studying the validity of the current method, the hierarchical 
Structure from the former study is reproduced in Figure 2 for com- 
parison with that from the current study for the common data of the 
two studies. The comparison is summarized in Table 5, where the ob- 
Jects are listed from left to right for each method to show the order in 
Which they classified. The order is indeterminable by each method for 
objects C, F, and P. That they form a cluster of indeterminate order 
amongst themselves in each method is indicated by the numbers 1, 2, 
and 3. АП other clusters of this kind were for only two objects and are 
indicated by the numbers 1 and 2 associated with the objects of such 
clusters. The earlier method produced three clusters of this kind, com- 
pared with five for the current method. 

In the above cases where the orders were undetermined by the 
method, the members of the clusters were arranged to minimize the 
discrepancy between the orders obtained by the two methods. There 
Were differences between the two methods in the position classification 
of only two objects as shown in Positions 16 and 19 where each B and 
К in the current’ method is three positions removed from where they 
are in the earlier method. The greater speed and capacity of the current 
method probably compensates in most sets of data for errors (if any) 
Which might result from the current method. 


Conclusions 


- 


Highest Entry Hierarchical Classification is the most rapid method 
thus far developed for hierarchical clustering of matrices of inter- 
associations between objects of all sizes up to 1,000 X 1,000 and even 
arger, and it compares favorably with the best of methods in terms of 
reliability and validity. 
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О FACTOR ANALYSIS: APPLICATIONS ТО 
EDUCATIONAL TESTING AND PROGRAM EVALUATION 


F. STEVENS REDBURN 
Youngstown State University 


The potential use of Q factor analysis for evaluating and designing 
educational and clinical programs is discussed and illustrated. The 
advantages of this technique in comparison to normative measure- 
ment employing prevalidated or a priori scales include additional 
richness of theoretical insight, discovery of the structure as well аз 
the content of the individual's thinking, relative independence from 
prior conceptualization, and efficiency in gathering detailed informa- 
tion quickly. Q factor analysis is most appropriate for use in clinical 
or educational situations where available typologies or scales seem 
inadequate, where the psychological dynamics of learning or treat- 
ment are not well understood, or where it is desirable to avoid an- 
ticipating the precise direction and character of program impact. 
Several practical applications of the technique are suggested. 


VIRTUALLY all educational measurement and testing focuses on nor- 
Шайуе comparisons or the movement of subjects relative to 
Prevalidated or a priori scales. Although the alternative perspective 
Offered by О methodology, and especially by О factor analysis, is well 
known (Stephenson, 1953; Kerlinger, 1964), its potential for use in 
evaluating and designing a great variety of educational and clinical 
Programs has been neglected. 

: The following brief description of a study of perceptions and at- 
titudes held by a group of college urban interns and changes in their 
Patterns of thinking following the internship will suggest various uses 
9f the О factor analysis methodology. Although the findings of the 
Study have been quite provocative to the faculty associated with this 
Program, the focus here is not on the substantive results but on the 
Potential of О as a testing procedure that, while possessing limitations 


Copyright © 1975 by Frederic Kuder 
767 


768 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


of its own, avoids certain pitfalls associated with traditional scaling 
approaches. 


Rationale for the Use of Q Factor Analysis to 
Measure Educational Program Impacts 


In Q methodology, subjects sort (i.e., rank-order) a series of state- 
ments according to an abstract criterion (often an agree-disagree 
dimension). The ordering of statements may then be examined for its 
meaning relative to a theory or relative to other evidence of the in- 
dividual's state of mind. Factor analysis adds to the power of this 
technique by permitting the identification of clusters of individuals 
who have performed similarly. First the Q-sorts of all individuals un- 
der study are intercorrelated. The resulting Q matrix is then factored 
to locate clusters of respondents.? 

The interpretation of the ordering of statements typical of a par- 
ticular factor is essentially a subjective and imaginative enterprise; it is, 
therefore, useful on Occasion to have two or more individuals in- 
dependently label and describe the factors. Where there are differences 
in interpretation, their discussion will often generate additional 
richness of insight, contributing to the value of the technique as a 
means of exploration and discovery. 

It is the latter purpose which Q factor analysis best serves. It is most 
appropriate for use in clinical or educational situations where 
available lypologies and scales Seem inadequate, where the psy- 
chological dynamics of learning or treatment are not well understood, 
or where it is desirable to avoid anticipating the precise direction and 
character of program impact. In short, this measurement approach is 
invited In most if not all small group clinical and educational program 
categories, 

Why is Q factor analysis more likely to offer fresh insights into the 
behavior of individuals? Іп part, this is so because it is a highly efficient 
technique for quickly gathering detailed information about someone's 


"In the experiment to be Я 
used ; oe n e 
ments, each one printed here for illustration, this is done by sorting 51 stat 


tribution which is the same for each 


(most disagree) 
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thinking in a form that lends itself to quantitative comparisons with 
the thought of others. Specifically, anyone performing a Q-sort of n 
statements will make, in effect, %[л(л — 1)] comparisons between pairs 
of statements in little more time than is required to respond to и in- 
dependent scale items. 

The relative rankings of statements often give insight into the struc- 
ture as well as the content of an individual's thinking. In the typical 
testing approach there is a further loss of information (as well as a 
potential gain in information) when a number of items are added 
together to form a scale. In an examination of the completed Q sort, 
on the other hand, statements are not treated as scale items (although 
scales are frequently used to structure а О sample); rather, their 
relative placements are studied in an attempt to penetrate the logic of 
the individual. If this examination follows a factor analysis of several 
sorts, then the ordering examined is actually the weighted average for 
a similarly sorting group of individuals. In either case, unanticipated 
item juxtapositions are likely to reveal logical or psycho-logical 
relationships that would be blotted out by treatment of each subject or 
Set of subjects as a cluster of scale scores. It is this relative in- 
dependence from a priori categorization that is the primary strength of 
Q when dealing with the impacts of ambiguous, complex symbolic ex- 
Changes. Such transactions are the essence of many educational and 
Clinical programs. 

lt must be observed that independence of a priori categorization is 
not total in this or any methodological strategy. There are two times in 
{һе 0 research process when predefinition imposes limits оп discovery. 
One is during construction of the Q sample, when statements are 
selected according to an implicit or explicit theory. The use of variance 
designs in sample construction is not only limiting in the respect sug- 
gested but useful in helping to insure ecological representativeness 
(Brown 1970a and 1970b; Brunswik, 1956). The danger of foreclosing 
discovery is reduced if one remembers that the use of theory at this 
Stage is intended mainly to maximize coverage of the attitudinal do- 
main of interest; in examination of the completed sort, items should 
Not be treated simply as components of scales but as statements that 
аге invested with various meanings by various individuals. For in- 
Stance, an item referring to the President that measures one in- 
dividual's political alienation may be reacted to by a second individual 
Solely in terms of his feelings toward the incumbent. If items have no 
fixed meanings, then their interpretations will only become apparent, 
fat all, when viewed in the context of the same individual's responses 
lo other statements or stimuli. Employing a variance design will 
hopefully provide enough of that context to expose that meaning. 
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The second point at which preconceptions of the researcher stand in 
the way of discovery is during interpretation of the completed Q sort 
or the ordering of statements typical of a factor. Just as the same state- 
ment may generate different meanings for different individuals, so may 
the same arrangements or patterns of statements suggest different 
logics to different researchers. If one is reminded that a major value of 
Q is its use as a tool of discovery, then the possibility of multiple in- 
terpretations has a positive as well as a negative side. If, on the other 
hand, one is concerned about validity, i.e., is anxious to identify an in- 
terpretation which is most consistent with a particular theoretical or 
professional perspective or with evidence from other types of measure- 
ment, there are means available for this. The use of two or more in- 
dependent judges has been suggested already. A second device, which 
establishes a matrix for comparisons between subjects and pre- 
established norms, is to include among the Q sorts prior to factor 
analysis sorts that have a previously established theoretical interpreta- 
tion. These may be sorts performed by individuals who are known to 
have a particular orientation or “dummy” sorts composed to conform 
to an abstract type.* In the illustrative use of О to be discussed in the 
following section, the Q sorts of instructors have been included to 
provide points of reference for interpretation of students’ thinking. Of 
Course, there is no device or combination of devices that will remove 
the subjectivity inherent in both the design and interpretation of 
educational and psychological research. 

The following example suggests one possible use of Q factor analysis 
аз a testing device in educational or clinical programs, 


Changes in Cognition, A есі, and Evaluation by 
Twenty Urban Interns 


Two classes of urban inter 


| ns, a total 51- 
item Q-sort a total of 20 students, completed a 


The Prior to and at the conclusion of their intern experiences. 
patil ata Program under study combines 300 hours of employ- 

90 he with a local public agency, a weekly seminar of three hours, and 
a м gf work on ап academic project to be used by the employing 
кере b ап intensive, 10 credit-hour exposure that is accorded high 
of eres ra and occasionally leads to permanent employment 
of the pro; У à participating public agency. The primary purpose 
mitment Bram, as seen by the funding agencies, is to stimulate com- 
nt to careers in urban administration. The course is conducted 


5 ланын" ——— | 


—————— 
ч i i indivi 
a t CEN individuals for whom an orientation has been previously es- | 
to validate a етуі. work of Fred М. Kerlinger on the use of Q factor analysis 
ly identified structure of social attitudes (Kerlinger, 1972). 
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jointly by three instructors, each of whom also completed the Q-sort 
during the study period. 

The Q sample is constructed around 10 dimensions or substantive 
foci, as shown in Table 1. It is better practice for many purposes to 
narrow the attitudinal domain of interest so as to provide more 
detailed information and thereby aid interpretation. The objective 
here is to cast the net broadly in order to learn whether and to what 
degree various cognitive, affective, and evaluative orientations are 
changed by the work of the course. Although a factorial design is not 
formally employed, an attempt has been made to include under each 
focus both realistic and stereotyping statements. 

For purposes of the factor analysis, the “before” and "after" Q 
sorts are treated as performances of separate individuals. One result of 
this is that students who show more stability will have a greater in- 
fluence on the factor structure that emerges. It is also possible to plot, 
in terms of factor loadings, the movement of each student relative to 
the factor structure. A student may, for instance, move from a high 


TABLE 1 
Substantive Foci of the Urban Interns 0 Sort and Number of 
Statements Devoted to Each 


Number of 
М: oo ss 
Nature of Local Decision-Making Role 
Sense of Political Efficacy 
Attitudes toward Careers in Urban Administration or Politics 
Elitism vs. Populism 
Perceptions of Local Power Structure 
Perceptions of Public Officials' Integrity 
Perceptions of Public Officials’ Responsiveness 
Estimated Impact of Local Government Policies 
City's Problems and Probability of Solution 


4 
6 
8 
4 
5 
4 
6 
Attitudes toward Local Area 2 


Бов зета 


* Total exceeds 51 due to double counting of three statements. 


—————— 

* A copy of the Q sample may be obtained from the author. Typical statements are; 
(1) Local government is a great force for good in most communities; (2) Local govern- 
ment is only one of many forces acting on the city and probably one of the least power- 

ul in shaping the urban environment; (3) I think that through a career in local govern- 

fent I could do a great deal to help my community; and (4) I doubt that I will ever 
now enough to play a major role іп the public affairs of my community. Д 

lo So far as can be discovered, the few previous applications of 0 factor analysis to 

Ngitudinal analysis with the same set of subjects do not exploit its potential as an in- 
strument for detecting the movement of individuals relative to a given factor structure. 
ie for instance, John M. Butler's study (1972) which employs repeated 0 sorts and fac- 
БА to demonstrate that clients of psychotherapy show significant changes in 

"ideal correlations. Butler does not factor together Q-sorts completed at different 


Stages of therapy. See also Brenner (1972). 
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loading on factor one prior to the urban internship to a more am- 
biguous position in which he displays moderate loadings on factors 
one and three. The interpretation of this movement depends on ex- 
amination of the statement orderings typical of the factors and the use 
of evidence drawn from other sources concerning the student's ex- 
periences. 

The urban internship is apparently a very different experience for 
different individuals. Six factors were identified." On each of these 
anywhere from two to eight students were significantly loaded at 
program entry and from three to seven at program's end.’ The factor 
loadings of students were generally stable, with sixteen students loaded 
on the same factor before and after. Correlations between before and 
after performances were all positive, ranging from +.16 to +.73 witha 
mean of +.50. On the other hand, some students who remained loaded 
on one factor showed movement relative to other factors. 

A few students produced strikingly different performances on the 
two sorts, In particular, two students who were initially loaded on the 
first factor dropped off this factor and two students moved onto this 
factor. 

FACTOR ONE: Because space does not permit full discussion of 
the substantive results, analysis will be confined to the most heavily 
populated of the six factors. This factor is of interest also because of 
the changes in student loadings referred to above and because the 
ordering of statements by this group corresponds closely to a frame of 
mind consistent with what was identified above as a principal objective 
of the internship program. 

Table 2 lists the four statements most strongly agreed with and the 
four Statements most strongly disagreed with by the typical member of 
this group, excluding one statement on regionalism that received high 
Scores on all six factors. 

‚ This group of interns is excited about the possibilities for construc- 
tive change through careers in urban government. They are anxious to 
Берагі of this Process. They reject the most cynical stereotypes of local 
officials and are, if anything, overoptimistic in estimating the potential 
impact of government on the city. They are also democratic rather 


than elitist in orientation and (based on the rankings of statements not 


ESSERI TIER 


“А varimax orthogonal rotati h 
Otation was ei inii 5 of 1.8 
used as the standard f mployed, with a minimum eigenvalue 


е n or Tetention of factors prior to rotation. 

Bises c ee as significantly loaded are those located above +.35, significant 

Persons dab iot za 9 -Lacey expression for the standard error of a zero correlation. 

Б n» a as efining for a factor are those who meet the above criterion and 
© Әй other factor loadings at least .15 lower than their highest loading. Defining 


persons are those whose Q-sorts are includ 2 x feet 
typical of a factor. led when computing the weighted averag 
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TABLE 2 
Extreme Statements Typical of Factor One with Z-Scores* 
Statement Z-Score 
I think that through a career in local government I could 
do a great deal to help my community. +1.56 


Local government's impact on the design of the city and 

the solution of its problems is probably greater than 

it has ever been thanks to improved planning and a huge 

influx of new funds +1.54 


Think 1 would enjoy some kind of career іп 
urban administration. +1.52 


If we are to have good government, we must not let 
power slip from the hands of the people. +1.37 


There is so much easy money in government today that 
Тат afraid a lot of local officials are on the take. -157 


Local government is only one of the many forces acting 
on the city and probably one of the /east powerful in 
shaping the urban environment. —1.59 


The role of the elected local official consists 
largely of doing favors for one's friends and relatives. -1.83 


I doubt that I will ever know enough to play a major 


role in the public affairs of my community. 21.92 


_* The four statements above the line are those receiving strongest agreement from this group: the four below the 
line ure those receiving strongest disagreement. 


shown in Table 2) perceive local government to be run basically on the 
democratic model. 

_ Figure 1 shows the shifts in student loadings on this factor. The plus 
sign at the right of the diagram indicates the positive loading of one in- 
Structor on this factor.’ The two students moving onto this factor dur- 
ing the program report having highly positive internship experiences. 
One of these students received a permanent job with his agency; the 
second volunteered to continue participating in the seminar portion of 
the program to help orient incoming interns. The clear downward 
movement of five interns relative to this factor may indicate somewhat 
dampened enthusiasm and/or growing realism about the impact in- 
dividuals and governments can have on the urban environment. 

As previously suggested, there is not a great deal of movement 
relative to the factor structure, particularly with respect to factors two 


" “Іш this instance, the three instructors show distinctly different orientations. Instruc- 
H A is a hybrid of factors 1 and 2; instructor В is on factor 2; and instructor C is 
marginal to factor 4, This lack of consensus no doubt contributes to the variety of mean- 


185 that can be extracted from the internship experience and possibly weakens its effect 


Benerally, 


774 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


BEFORE AFTER 
Figure |. Movement of Students Relative to Factor One. 


Shows Before-After comparisons for those loaded before or after at +.35 or above. 
Instructor loading is Tepresented by cross at right. 


ough six. Perhaps there are clues to the stability of this structure in 


thr à 
the apparent focus of each factor. Factor two can be characterized, 
superficially at least, 


three and five Suggest 
sider perspective on 
characterize in an о, 
as the viewpoint of 
behind the city and its 
main to be rooted in s i 
агу socialization, If i 
social character of a 
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character аге, іп fact, imbedded in a structure of belief that has been 
forged during primary socialization and is still being reinforced by 
significant others, this would explain why even the most intense 
academic experiences would in many instances have little impact on 
these specific perceptions (Berger and Luckmann, 1966).* 

The full elaboration of this line of thought deserves far more space 
than is available here. This very general review of substantive findings 
is intended rather to illustrate the suggestive power of Q factor 
analysis as a device for program evaluation. In this case, Q analysis 
obviously raises fundamental doubts about the direction and intensity 
of the urban internship's short-run impact on cognitions. 


Possible Applications of Q Factor Analysis to 
Educational Program Evaluation and Testing 


The following list of possible applications of Q factor analysis to 
educational testing and program evaluation is not meant to be ex- 
haustive and will no doubt suggest other uses. Q factor analysis may 
be used: 


l. To estimate at entry the probability an individual will respond ap- 
propriately to a particular treatment or program. A gradual build- 
up of experience with various "factors" or “types” would 
naturally lead to more individualization of curricula or treat- 
ments. [п this case, О would be a supplement to the judgments of 
sensitive and experienced professionals. Use of Q factor analysis 
for initial screening would also provide a basis for exclusion of 
those already possessing the attitude configuration at which it is 


aimed. (See also use 4.) 

2. For continuous monitoring of short-term movements relative to a 
factor structure. This may be either for the purpose of evaluating 
the impact of program components, e.g., a teaching module or 
field trip, or to determine the length or type of program needed 


eo ل‎ 
* Berger and Luckmann suggest that all secondary socialization processes must build 
n or seek to overcome the subjective reality constructed during primary socialization. 
To alter an established structure of belief those engaged in secondary socialization must 
intensi fy the affective charge of the process through establishment of intimacy and iden- 
{ification or by forging a psychological link between the new reality being presented and 
vw home" reality produced by primary socialization (Berger and Luckmann, 1966, pp. 
+: 135; Меаа, 1934). ч 
is he efficacy of such a program may, of course, be measured in other terms, such as 
© numbers of students choosing urban careers. It is also possible that major changes in 
Аиде Structure have been introduced that will not be manifest until later. Certainly, 
ollow-up testing is called for if the intention is to evaluate the full impact of such a 
Program. The use of control groups will be essential for some testing purposes. 
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by an individual. The factor structure may be one determined by 
the current student or client population or it may be defined by a 
representative group of previous students plus the individual un- 
der examination or it may be defined primarily by a group 
selected for some other purpose. (See uses 5 and 6.) 

3. То specify conflicts of viewpoint among instructors or other 
program personnel for the purpose of assessing the impact of 
these conflicts on students or clients, to resolve such conflicts, or 
simply to clarify the nature of such conflicts so they can be taken 
into account or called to the attention of students or clients. 

4. To identify attitude types or factors that are frequently products of 
primary socialization and so are resistant to change and to identify 
those which are ordinarily less stable. (See the discussion of the 
urban internship program factors in the preceding section.) 
To evaluate student performances in order to make normative judg- 
ments regarding changes in cognitions." Norms need not be es- 
tablished prior to testing but instead identified through factor 
analysis. For instance, if there is a professional consensus on the 
desirable pattern of thinking then this can be identified by in- 
cluding a set of professionals іп the population of sorters. In the 
absence of such a consensus, of course, no normative judgments 
are justified. 

6. To identify how the presence of various attitude “types” in a class 
or treatment group affects the individual's response to the program. 
Class peers often play a major role in mediating program effects. 
By systematically varying the presence and proportions in the 
class of various attitude "types" it may be possible to learn in a 
general way the contribution this mediation makes to the 
program's impact, 


саа 


Conclusion 


The purpose of this discussion has been to present the strengths of Q 


factor analysis as a technique for use in educational and other 


5% ا 
Educational testing is often em;‏ " 


i i ile it is not the 
purpose of this essay to challenge tl ployed as a device for grading. While it is no 


ў во hat use, О factor analysis does raise a conceptual as 
Miei o rea EMG to the relevance of ЕА) systems of grading student 
stance or quality (е XS ucational programs, Such systems focus on changes in the sub- 
а жад Ne complexity) of students’ cognitive orientations. If the symbolic 
highly variable and p hi and ambiguous, if the meanings taken from the experience are 
and modelling th _ Perhaps unique to each individual, then the problems of measuring 

ng the impact of the program are considerable. The appropriateness of at- 


tempting to appl: Е 
should be challenged QN standards of performance to the clients of such programs 
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programs characterized by complex symbolic transactions. Its general 
advantage seems to be that it avoids the rigid imposition of a set of 
categories established prior to testing. The rationale for avoiding such 
a priori constructions is given by the complexity of both aims and 
structure that is typical of such programs. Such complexity is itself a 
reflection of individuals’ variety and complexity and the consequent 
need for programs that will approach one set of objectives for certain 
individuals and quite another set for others. 

The post-industrial future, according to one view, will be 
characterized by a growing volume of human services in the model of 
those addressed here, i.e., two-way or multi-sided symbolic interac- 
tions where clients become participants who help define the nature of 
their problems and help work out the solutions (White and Gates, 
1974). 

If we are to have hope of properly assessing the impacts of such ex- 
changes, then it will be necessary to develop monitoring techniques 
that avoid stereotyping participants, are sensitive to movements in un- 
anticipated directions, and enrich the possibilities for discovery of a 
range of participant responses that cannot ordinarily be anticipated in 
the design of traditional impact measures. Q factor analysis is one such 
technique. 
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THE CALCULATION OF RELIABILITY FROM A 
SPLIT-PLOT FACTORIAL DESIGN 


ROBERT L. BRENNAN 
State University of New York at Stony Brook 


This paper treats the question, "How should one estimate the 
reliability of school (or classroom) means when persons are nested 
within schools (or classrooms)?" We begin by reviewing the use of 
variance components in the estimation of reliability from a ran- 
domized block (R B) design. Then we extend this rationale to the es- 
timation of reliability (or generalizability) coefficients in a split-plot 
factorial (SPF) design with persons nested within schools. 

Through the use of variance components from the SPF design, we 
derive estimates of reliability for schools and for persons within 
schools. Then we compare the reliability for persons within schools 
from a SPF design with the reliability for persons from a RB design. 
Finally, we compare the reliability for schools from a SPF design 
with the reliability for school means from a RB design. 


PERHAPS the most widely used formulas for measuring reliability 
when a test is administered once to a group of persons are those that 
can be viewed as providing the average of all possible split-half 
reliability coefficients for the test. Cronbach (1951) has shown that his 
Coefficient о, К uder and Richardson’s (1937) Formula-20, and Hoyt’s 
(1941) reliability coefficient can all be interpreted in this manner, and 
are, therefore, coefficients of equivalence. In order to calculate any of 
these Coefficients one makes use of sample statistics from a persons by 
items data matrix; moreover, in calculating Hoyt's coefficient one 
analyzes the data matrix in the framework of a randomized block fac- 
torial analysis of variance design. 

In this paper, Hoyt’s idea of calculating reliability from a ran- 
. domized block design is extended to a split-plot factorial design. In its 

Simplest form this design allows for the incorporation of an added 
dimension into reliability analyses, namely the nesting of persons in 
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some larger unit, such as schools or classrooms. In the opinion of this 
author, the experimental model used to collect data for most reliability 
analyses is usually one in which persons are nested within some dimen- 
sion; therefore, the split-plot design would appear to be more ap- 
propriate than a simple randomized block design. In addition, the 
split-plot design can be used to provide a basis for estimating the 
reliability of scores for the units (e.g., schools) within which persons 
are nested. Such reliability estimates are often needed for statistical 
analyses in which the unit of analysis is the school or classroom, rather 
than the person. 

In order to introduce the method of calculating reliability from a 
split-plot design, we begin with a brief description of the use of 
variance components in the calculation of reliability from a ran- 
domized block design. The reader unfamiliar with this technique 
might consult Lindquist (1953) or Cronbach, Gleser, Nanda, and Ra- 
jaratnam (1972) for more detail. 


Reliability Using a Randomized Block Design 


A randomized block design is a repeated measures design in which 
the linear model is: 


X = M + P, Lj Ply + Ep! (1) 
where 


du Xi; e of person i (i = 1, 2, ..., n) to item j (= 
M = population grand mean, 
Р, = effect for person i in the population, 
1 = effect for item j in the population, and 
, Ply = interaction in the population of person i with item j, which 
is confounded with 


Ewn’ = experimental error. 


Table | provides computational formulas and expected values of the 
mean squares for this design. 

Reliability is, in general, defined as the ratio of the variance of true 
scores 10 the variance of observed scores. From (1) it is clear that the 
variance of the observed scores for item j is given by: 


ox? = др? + oz? A (2) 
س‎ 
' Millman and Glass 


determination of мы A and Kirk (1968) provide rules that greatly simplify the 


ted values of mean Squares. 


„Жел x сауы 
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TABLE 1 
Randomized Block ANOVA Table 


ішсе S.S, df. M.S. E(M.S.* 


ween Persons (P) - (С) n-1 MS(P) og? + вы? + Кор? 
thin Persons (РІ) - (P) т(Е- 1) MS(w.P) 
Items () = © k-i MS(I) og? + вы? + по? 


Persons by Items (РГ) - (P) (n—1Xk— 1) MS{(PI) сЕ? + ор? 
-() + (С) 
tal (РІ) - (С) nk-1 


Фр= ХА (Р) = 1 EO хоў 


O-iQEx? MFI x 


interaction term (PI), we will let eg" = ap? + ap. 


where 


ср? = true score variance (ie, the population variance for 
persons), and х 
бе) = error variance for item j. 


Since ов? = о,” for all k items, the reliability of a one-item test is given 
by the intraclass correlation coefficient 


2 
Br Ro Е 3 
Hist File NS (3) 


Now, using Table 1 we find that ог? is estimated by [MS(P) - 
MS(PI)|/k and cg? is estimated by MS(PI). Thus, 


MS(P) — MS(PI) 


а= Е (4) 
" МР) = MSCD , wsp]) 
k 
MS(P)— MSCPI) _. (5) 


= MS(P) + (k — 0MSCT) 


In order to determine the reliability of a test of length K', опе can use 
the Spearman-Brown formula in conjunction with (3) or (5), above. 


indicated expected values are for the random effects model. Since experimental error (E*) is confounded with the persons by 
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Thus, 
ЕЕ. 
IF = Пн, 
Cx (7) 
T op + oe [k 


à MS(P) — MS(PI) (8) 


MS(P) + (55-4) MS(PD 


Tine = 


(6) 


One can also determine the reliability of a test of length К’ by mak- 
ing a direct application of the variance components from the ran- 
domized block design. From (1) we find that 


44 X х. | ИР) + 45 3 в, | 
т к 4 


= op? + on" /k'; 


i.e., the variance of the observed score means? over А” items equals the 
variance of the true scores plus the variance due to errors of measure- 
ment. Thus, 


(9) 


(10) 


В عي‎ > 
op + cg /k 
which is identical to (7) and, therefore, can be estimated by (8). Note 
that the error variance in (9) conforms with the usual formula for the 
variance of a mean. This use of variance components in conjunction 
with the basic notion of reliability has been extended by Cronbach e 
al. (1963, 1972) into the Theory of Generalizability. In this theory, or 
in (10) is called the universe score variance, (9) is called the expected 
value of the observed score variance, and (10) is called the coefficient 

of generalizability, 


When k = k' in (6), (7), (8), or (10), then we get, what is often called, 
the reliability of a test of full length: 


_ MS(P) — MS(PI) (11) 
та АО 
MS(P) 
Formula (11) is best-known аз Hoyt's (1941) reliability formula, 
which is algebraically equivalent to Cronbach’s (1951) coefficient а. 
When all items are scored in a correct-wrong (1-0) manner, then For- 


—A——— E QNNM 

* Here, and elsewhere in this 
reader should recall that the 
total scores, 


Paper, we speak about the reliability of mean scores. pe 
reliability of mean scores is identical to the reliability 
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mula (11) is also equivalent to Kuder and Richardson's (1937) 
Formula-20. 


Reliability Using a Split-Plot Factorial Design 


The above formulas for reliability using a randomized block are 
well-documented in the literature (see, for example, Stanley, 1971). 
Here we extend the ideas of the previous section to a split-plot fac- 
torial design in which persons are nested within some higher order 
dimension (which, here, we will call schools"), persons are crossed 
with items, and schools are crossed with items. Thus, 


Хм, = М + 5. + Pim + 1, + Sls; d TP jim) + Ести} (12) 
where 


Хм, = response of person i (i = 1, 2, --- , n) nested in school m 
(т = 1, 2, --- , r) to item j (j = 1, 2, --- , k), 
M = population grand mean, 
Sm = effect for school m in the population, 
Pim) = effect for person i, nested within school m, in the popula- 
tion, 
1) = effect for item / in the population, 
SIm; = interaction in the population of school m and item j, and 
IP jm) = interaction in the population of item j and person i, nested 
within school m, which is confounded with 
Е оту = experimental error. 


Note that, in order to simplify the exposition, we assume here that 
Шеге аге an equal number (n) of persons nested within each school. 
Under these circumstances, computational formulas for the analysis of 
Variance in the split-plot design are given in Table 2 with mean squares 
and expected values of the mean squares given in Table 3^ 

Using the information in these tables we will express two different 
reliability coefficients for the same test and compare these coefficients 
With similar coefficients obtained from a randomized block design. We 
Will not, however, go through the steps involved in deriving the соећ- 
cients that emanate from the split-plot design. The derivations proceed 
Ma straight-forward manner by applying the method of variance com- 


js If the number of persons nested within each school is not the same for all schools 
yn Опе must obtain adjusted mean squares. If persons were lost for reasons unrelated 
н the conduct of the experiment, then an unweighted means solution should be used. If 

е experiment deliberately employed unequal n's, then a least squares solution should 


Sed. See Kirk (1968, pp. 204-208, 276-281) for a discussion of these analytic 
echniques, 
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TABLE 2 
Computational Formulas for Split-Plot AVOVA 
Degreesof — | 
Source Sums of Squares Freedom 
Between Subjects (SP) - (C) т— 1 
Schools (S) - (C) ы! 
Persons within schools (SP) - (S) r(n — 1) 
Persons within (SPm) – (S4) нА)! 
School т 
Within Subjects (SPI) – (SP) rn(k — 1) 
Items (1)- (С) к= 
Schools by Items (SI) - (S) - (Т) + (C) (r — VK = 1) 
Items by Persons within Schools (SPI) — (SP) — (SI) + (S) r(n = 1)(к = 1) 
Items by Persons within (SPI) — (SPm) — (SIm) + (Sm) (n = 1)(К— 1) 
School m 
Total (SPI) – (C) rnk 1 


(SPL)- 2 53 E 


DD OD 


1 2 
(Sm) = oe io» У Жо 


(Sh) = Y QE xy 


(SP) = (SPI) (©) = (E E D4 


(SD = XR) (у= УУ Dx 


(S) - >; (S) 


(SD = У (51,) 


P TABLE 3 
ean Squares and Expected Values of Mean Squares for Split-Plot ANOVA 
л л iil 
бошсе Mean Squares (MS) EMSY 
л Тш و‎ 
Between Subjects MS(b.Subjs.) 
Schools MS(S) сый + orp + поз? + Кор? + таз, 
Persons within Schools MS(Pw.S) 2 + opp? У 
Persons within \ MR 
M MS(Pw.S,.) 
м Subjects MS(w.Subjs.) 
tems MS(1) j = Е 7 
"+ 2+ + nra, 
ae а dE MS(S by I) un + oat + АТА d: 
rsons wi 
Medus s within MS(I by Pw.S) Gg + отр? 
Items by Persons MS( by Pw.Sm) 


within School m 


ENN d 
Ru a а values are for the random effects model. If schools are fixed and items and persons are random. be 
mental error is confounded eus жі the exception of the expected value of MS(I) which becomes ej" + ore? + Mr. Since © 

* confounded with the items by persons within schools interaction term, we will et ez! = oe? + 717. 
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ponents in conjunction with the basic definition of reliability (or 
generalizability ). 

Reliability for Persons 


The reliability of a test of length К’ for persons within schools is 
given by: 


2 


ar LUE желі 
Terk a Es PRUTU , (13) 
which is estimated by: 
DM Nes E by Pw.S) Ў (14) 
MS(Pw.S) + ( Т Jusa by Рю.) 


It is interesting to compare (13) and (14) with the reliability that 
would be obtained if all п persons were placed in a randomized block 
design, thus disregarding the school dimension. Under these circum- 
stances, the resulting reliability, expressed in terms of variance compo- 
nents from the split-plot design, is given by: 


Piin 
24 n(r я 
hau ШЫ 4,122 20 ШУ) 
R T ті 
ЕЕЕ [EP ar + er] 


which is estimated by: 


rn MS(w.Subjs) — MS( | 


MS(b.Subjs) — [ т 


Brem пей 09 
- E .Subjs) — 
MS(b.Subjs) + zy = [pases — маш] 


Algebraic manipulation of (13) and (15) reveals that (13) is less 
Шап, equal to, or greater than (15) depending upon whether op / ag! is 
less than, equal to, or greater than 052/051“, respectively. Thus, if one 
Uses a randomized block design to calculate reliability for persons 
When, in fact, persons are nested within some dimension, such as 
Schools or classrooms, the resulting coefficient will be biased, and, 
Moreover, the direction of bias will be unknown. Since, in most 
reliability studies, persons are sampled in a stratified random manner, 
father than simply randomly, it would appear that Formulas (13) and 
(14) аге preferable to the reliability formulas emanating from a ran- 
domized block design. 

Using the split-plot design, one can also determine the reliability for 
Persons within any school m. This is given by: 


786 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


E MS(Pw.S,) — MS(I by Pw.S,) (7) 
MS(Pw.S,) + (А P x )usa by Pw.S,) 

which is algebraically equivalent to the coefficient one would get if the 

n persons in school m were analyzed alone in a randomized block 


design. 


Tee 


Reliability for Schools 


In thinking about the reliability of a test, many of us are so ac- 
customed to thinking about reliability for persons that we overlook 
the fact that reliability is а generic concept that is applicable to any 
unit of analysis. Often, in educational and psychological research, the 
intended unit of analysis is not the person. For example, it often occurs 
in large-scale observational studies that tests are administered to 
persons within a number of schools, and the unit of analysis for sta- 
tistical purposes is the school, not the person. Thus, if reliability is 
to be taken into account in such analyses, one needs the reliability 
of school means, which can be obtained quite easily from the split-plot 
design. 

Using the method of variance components, one finds that the 
reliability of a test of length К’, for schools, is given by: 


oe? 
Pep = s . (18) 


1 1 f1 
es cuo) bu (! ox + ға) 
In general, Formula (18) cannot be expressed very succinctly іп terms 


of mean squares from the split-plot design; however, if А’ = k, then 
(18) can be expressed as: 


б MS(S) — [MS(Pw.S) + MS(S by I) — MS(I by Pw.S)]. 
MS(S) 


(19) 


It should be noted that, in this case, one must use the method of 
Variance components in order to determine reliability because the 
Spearman-Brown Formula cannot take into account the fact that 
op?/n is part of the variance due to errors of measurement for all 
values of К'. This problem is encountered because we are calculating 
reliability for schools from a design in which the dependent variable is 
à score for a person nested within a school. 

Another way to approach the problem of calculating reliability for 
schools would be to use a tandomized block design in which the 
dependent variable is an item mean Over persons (1.е., item difficulty 
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level) within a school. In this case, the reliability for schools, expressed 
in terms of variance components from the split-plot design, is given by: 


1 
es + > ср? 


т = 1 1f 5 (20) 
os + пор + E os + m 
which is estimated by: 
M — MS(S b 
na = — МУ = MSS by D. ор 
MS(S) + V MS(S by 1) 


The only difference between (20) and (18) is in the addition of the term 
vy! /n in the numerator of (20). Thus, (20) is an upwardly biased es- 
timate of (18). In effect, (20) includes a function of the variance due to 
persons within schools in what should be the true (or universe) score 
variance due to schools alone. However, if ср? is close to zero or n is 
quite large, the discrepancy between (20) and (18) will be corre- 
spondingly small. 


Discussion 


The formulas in the previous section were derived under the as- 
sumption that all effects are random; however, these formulas are 
equally valid if persons and items represent random effects and schools 
represent fixed effects. Under these circumstances, as noted in the 
footnote to Table 3, the only E(MS) that is altered is the one for items, 
and MS(/) does not affect any of the previous formulas. 

Brennan and Kane (1975) provide formulas for estimating the 
teliability of school means for mixed effects and fixed effects models. 
For example, Formula (20) can also be viewed as the reliability of 
School means when items are random and persons are fixed. 

In this paper we have concentrated on a relatively straight-forward 
extension of Hoyt's randomized block analysis of variance technique 
for calculating reliability; however, the split-plot design, in even its 
Simplest form, appears to be quite powerful, and it is probably 
Sophisticated enough for most small-scale reliability studies and pos- 
Sibly for many large-scale ones. In addition, the split-plot design is 
Conceptually simple, and variance components and reliability (or 
&neralizability) coefficients are often quite easy to calculate, even 
Without a computer. 

Clearly, the ideas presented here might be extended for more com- 
plex reliability studies, in which persons are nested within more than 
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one higher order dimension and/or items are administered under 
different sets of conditions. The reader interested in such extensions 
(especially the latter one) should consult Cronbach et al.'s (1972) ex- 
cellent monograph on the Theory of Generalizability, which provides 
a framework for considering very sophisticated reliability analyses. 
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SOME COMMENTS CONCERNING THE USE OF 
MONOTONIC TRANSFORMATIONS TO REMOVE 
THE INTERACTION IN TWO-FACTOR ANOVA'S 


SCHUYLER W. HUCK anp CARY О. SUTTON 
The University of Tennessee 


In his discussion of two-factor experiments, Winer (1971) points 
out that it may be desirable to remove the interaction (and thus ob- 
tain additivity of effects) through a monotonic data transformation. 
The present authors extend Lubin’s (1961) discussion of ordinal and 
disordinal interactions by introducing the concept of “dual-ordinal.” 
This concept is important since a transformation cannot bring about 
additivity of effects unless the interaction is **dual-ordinal" in nature. 
For the applied researcher, a simple rule-of-thumb is set forth which 
allows one to determine, through visual inspection of a single interac- 
tion graph, whether or not an interaction is dual-ordinal. 


When graphed, a significant interaction from a two-factor analysis 
of variance can take one of two forms. Using Lubin's (1961) ter- 
minology, the interaction will either be “ordinal” or **disordinal." If 
the lines in the graph do not cross one another, the interaction is or- 
dinal. If, however, two or more of the lines cross, the interaction is dis- 
ordinal. Although his terms are used less frequently, Kerlinger (1964) 
distinguishes between these two forms of interaction by means of the 
adjectives *non-symmetric" and "symmetric," with the former term 
descriptive of a graph in which the lines do not cross. j 

In his discussion of factorial experiments, Winer (1971) explains 
that it may be desirable to remove the interaction (and thus obtain ad- 
ditivity of effects) through the use of a monotonic data transforma- 
_ ® tion. As Winer points out, however, 


If the means for the levels of factor A have the same rank for all 
levels of factor B, then a monotonic transformation can potential- 
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ly remove the A X B interaction. When such rank order is not рге- 
sent, a monotonic transformation cannot remove the A X B in- 
teraction (p. 399). 


In other words, a transformation cannot be used to "get rid of" an in- 
teraction if that interaction is disordinal. If the interaction is ordinal, 
however, a transformation may be able to eliminate it from both the 
data and the underlying model. Hence, the distinction between ordinal 
and disordinal interactions has a definite practical implication for the 
applied researcher. 

Unfortunately, whether or not an interaction is ordinal can depend 
upon the way the researcher constructs his graph. While the ordinate is 
always labelled with the dependent variable, the abscissa can be label- 
led with the levels of either factor A or factor B. If the interaction were 
to be graphed both ways (i.e., first with factor A on the abscissa, and 
then again with factor B on the abscissa), it is possible that both 
graphs might reveal an ordinal interaction or that both graphs might 
reveal a disordinal interaction. It is also possible, however, for one of 
the graphs to reveal an ordinal interaction while the other graph 
reveals a disordinal interaction. To verify this latter possibility, con- 
sider the following set of hypothetical cell means for a 2 X 2 design: 


FACTOR В 


FACTOR ERES 


т 


When the interaction is graphed with factor B on the abscissa, the two 
lines do not cross. However, when re-graphed with factor A on the ab- 
Scissa, the two lines do, in fact, cross one another. 

A data transformation can potentially remoye the interaction only 
when the interaction turns out to be ordinal when graphed both ways. 
In other words, the interaction must be “dual-ordinal’ in order for the 
transformation to have a chance of bringing about a condition of ad- 
ditivity. If either or both of the two graphs turn out to be disordinal, it 
will be impossible for the transformation to remove the interaction. 

Most researchers graph an interaction just once, with the abscissa 
labelled with the factor that makes the most intuitive sense within the 
context of the experiment. If this initial (and only) graph reveals a dis- 
ordinal Interaction, then the researcher will immediately know that his 
Interaction is not dual-ordinal. On the other hand, if this graph turns 
out to be ordinal, the interaction may or may not be dual-ordinal. The 
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following rule-of-thumb will allow the researcher to determine which 
is the case (without his having to construct a second graph): 


If all of the lines in the first graph slope upward, or if they all slope 
downward, then the interaction, if re-graphed with the other fac- 
tor on the abscissa, would also be ordinal. Thus, the interaction is 
dual-ordinal. Conversely, if one or more of the lines in the initial 
graph have an upward slope while one or more of the other lines 
have a downward slope, then the interaction, if re-graphed the 
other way, would be disordinal. In this case, the interaction is ob- 
viously not dual-ordinal. 


It should be noted that when an interaction is dual-ordinal in 
nature, a monotonic transformation may bring about a condition of 
additivity. Removal of the interaction, however, is not guaranteed. In 
some instances, the original interaction will vanish after the raw scores 
are subjected to a data transformation. In other instances, the interac- 
tion will continue to exist no matter what type of transformation is 
employed. For this reason, the rule-of-thumb presented above should 
prove to be more helpful in identifying interactions that cannot be 
eliminated via transformation than in identifying those that can be. In 
а sense, therefore, the examination of a graphed interaction in light of 
the "dual-ordinal" concept should be looked upon as а useful screen- 
ing technique which will prevent the researcher from wasting time try- 
ing to eliminate an interaction through a transformation when it is 
simply impossible to do so. 
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COMPARING THE VARIANCES OF SEVERAL 
TREATMENTS WITH A CONTROL 


KENNETH J. LEVY 
State University of New York at Buffalo 


The Dunnett procedure for comparing several treatment means 
with a control is applied to the problem of comparing several treat- 
ment variances with the variance of a control. Appropriate critical 
values are specified and an example is provided. 


OFTEN times, the experimenter is interested in making inferences 
about treatment variances instead of, or in addition to, inferences about 
means, Levy (1975) has proposed three multiple range tests for vari- 
ances which allow pairwise comparisons among k independent sample 
variances; the philosophy underlying these tests with respect to the 
probability of a type-I error is that advocated by Newman (1939) 
and later by Keuls (1952). The third test suggested by Levy is based 
upon a normalizing log transformation of the sample variances intro- 
duced much earlier by Bartlett and Kendall (1946). If an experimenter 
Were specifically interested in making pairwise comparisons of the 
Variances of several treatments with a control, this same normalizing 


| log transformation could serve as а basis for such a procedure. 


Dunnett (1955, 1964) addresses himself to the problem of compar- 


ing a set of k-1 treatment means with the mean of a control condition. 
Each of the resulting k-1 comparisons utilizes the same information 
Concerning the control condition; thus, the k-1 comparisons аге not in- 
dependent, Rather than setting a level of significance equal to « for 
cach of the individual comparisons, Dunnett establishes a joint signifi- 


cance level for the set of all k-1 comparisons. In the present paper, 


Same procedure is applied to the problem of comparing several in- 
dependent treatment variances with the variance of an independent 


Control, 
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Theoretical Basis 


Bartlett and Kendall (1946) investigated a normalizing log transfor- 
mation of the sample variance 5° when sampling from а N(u, о?) pop- 
ulation. They showed that log, s? is approximately normally dis- 
tributed as N(log, 0°, 2/n) where n is the number of degrees of freedom 
for 57. 

Let 50%, 51°, s, «++ 54, be k independent sample variances each 
based upon random samples of size n + 1 drawn from k independent 
normal populations, М(ш, о1?), with unknown means and variances. 
Let lo, lı, l2, -+ , 14: be the log transformations of the given sample 
variances; i.e. 

lo = Іор, 5,2, 
Then, the 1, are independently distributed approximately as 


Na, в?) 
where р, = log, o and 012 = 2/п. 
Let us now consider the quantities, z; where 
gpk = 1.) — (ш. = ш.) 
2V1/n 


As Dunnett (1955) points out, lower confidence limits with joint con- 
fidence coefficient 1 — а for the k — | comparisons ш; — ио will be 
given by 


(li =e) = d/2/1n, (i = 1,2, ...  k — 1), 
if the К — 1 constants 4/ are chosen such that 
P(z € d/',z, € d, ... 2-1 < dy.) = 1 — a. 
Similarly, upper confidence limits will be given by 
(Li = lo) + d/2,/17n; 


and, two-sided confidence limits having the desired joint confidence 
Coefficient will be given by 


(i= 1o)£d,"2/T/n (i= 1,2,... k-d) 
ifthek — 1 constants d," are chosen to satisfy 


Pal < d” Jz] «a, ... 1%-|<4-/)-1-а. 


1 To find any set of constants d,' or d," satisfying these equations, the 
Joint distribution of the z, is required. This distribution is а mul- 
tivariate normal distribution with means 0 and variances 1, where the 
correlation between z, and 2; is %. Tabulations, for equal values of the 
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arguments, may be obtained from Dunnett's original one-sided f 
tables and Dunnett’s revised two-sided / tables with degrees of 
freedom equal to c». Thus, for one-sided comparisons between k — | 
variances and a standard variance for a joint confidence coefficient of 
95 or .99, the critical values are: 
k — 1, number of variances (excluding the standard) 

1-а І 2 3 4 5 6 7 8 9 
95 | 1.64 1.92 2.06 2.16 2.23 2.29 2.34 2.38 2.42 
99 2.33 2.56 2.68 2.77 2.84 2.89 2.93 2.97 3.00 
For two-sided comparisons between k — 1 variances and а standard 
variance for a joint confidence coefficient of .95 or .99, the critical 
values are: 


k — 1, number of variances (excluding the standard) 
1-а 1 2 3 4 5 6 7 8 9 


95 16 221 2.35 244 2.51 257 2.601 265 269 
99 2.58 2.79 2.92 300 2306 3.11 315 319 32 


Example 


Suppose that a completely randomized experiment with three treat- 
ments and a control has been conducted. Suppose further, that the ex- 
periment employed 10 subjects per treatment and that the ex- 
perimenter is specifically interested in comparing the variances of each 
of the three treatments with the variance of the control; perhaps mean 
differences are unimportant or perhaps mean differences are not 
predicted at all. 

Suppose that the following results were obtained: 


s loges* 
Control 12.00 2.4849 
Тг 1 175 -.2877 
Trt2 3.00 1.0986 
Тиз 2.00 .6931 


Where 


п = the number of subjects in the ith group, and п — | = degrees of 
freedom for 5,2, 
Let us now test to see whether о), 037, and от? differ significantly 


from 6,2, i.e., we wish to test the null hypothesis 


Но: 00% = о’ 
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against the alternative 
Hio # о? 

for i = 1, 2, 3. When the null hypotheses are true, 

DUM (-.2877) — (2.4849) у 

2۷/1/9 
2, = OS em = —2.0794 
ds (.6931) — (2.4849) 49 


—2.6877 
2۷1/9 


For an overall о level controlled at .05, one would reject the 
hypotheses Ho: 0,2 = o, and Ho: во? = c, since —4.1589 and —2.6877 
are both less than —2,35, 


—4.1589 
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NEGATIVE SIMILARITIES' 


ULF LUNDBERG AND BERNARD DEVINE 
University of Stockholm, Sweden 


Two experiments were carried out for the present investigation. 
The first experiment was an exact replication of an experiment by 
Ekman (1955), where the subjects had been requested to estimate 
positive similarity between pairs of emotional terms. The second ex- 
periment was carried out in the same way except that the subjects 
were also requested to give negative estimations when they con- 
sidered that a pair of words described feelings which were opposite to 
each other. Using factor analysis it was found that the negative es- 
timations obtained in the second experiment were represented as 
zero ratings in the first one. The second experiment also yielded some 
additional information which was considered to be psychologically 
meaningful. A reanalysis of Ekman’s data (1955) gave almost exactly 
the same result as the first experiment in the present study. 


IN the early 19605 Ekman developed a multidimensional scaling 
model in which the similarity between two stimuli is directly 
Tepresented by the angle between and the length of hypothetical vec- 
tors (Ekman, 1963; Ekman, Engen, Kiinnapas and Lindman, 1964), 
these being used to represent the qualitative and quantitative 
difference between the stimuli, respectively. For the case where all the 
vectors were assumed to be of equal length the formula found to fit the 
empirical data was 

_ __ COS Ф:; 1 

i соз (04/2) id 

Where s,, represents the similarity estimate and егі) the angular separa- 
tion between stimulus i and j. Later Ekman (1965) presented a refor- 


—_ 
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mulation of the above equation solved for cosine elj 
608 Фу = (5/4) (зү, + М8 + sı), (2) 


and he noted that when.the similarities are given on a scale from Oto 1, 
the cosines derivable from the relation above differ from the actual 
similarity estimates by 0.07 at the most and, for practical purposes, the 
actual similarities may be used in the analysis, as had been the custom 
before the model was developed (Ekman, 1954, 1955). However, the 
model is only meaningful for positive values, i.e., when the angle 
between the vectors is less than 90°. Therefore, when the model is ap- 
plied the subjects are requested to give similarity estimations that are 
either positive or zero. Micko (1970) has produced a modification of 
Ekman's model which he terms a 'halo'-model. Negative scalar 
products of the stimulus vectors are possible within this modification. 
An alternative method of analysis was tried by Stone and Coles 
(1970), who used a technique where they correlated the columns of a 
similarity matrix and constructed à new matrix from the correlations. 
The matrix obtained in this manner also includes negative values. 
When they reanalyzed the data gained by Ekman (1965) and Kün- 
napas (1966), they found that a factor analysis yielded a smaller 
number of factors than was found in the original studies. However, 
although all the new factors were bipolar while the original ones were 
all unipolar, no new information appeared. A comparison between the 
Micko and Stone-Coles approach was made by Stone (1971). 
Although the expression “negative similarities” may itself sound 
contradictory, it does not follow that Subjects would not be able to 
Produce meaningful results were they to be given the possibility of 
using both Positive and negative estimations. One area in which this 
Possibility does Not seem too unrealistic is in the judgment of similarity 
between ‘motional terms, where some words describe feelings which 
may be perceived as Positively related, €.g., "glad-happy" while others 
Appear more or less opposite to each other, e.g., “‘glad-sad.” The psy- 
chological meaningfulness of opposing stimuli of this kind has been 
nd Coles (1970) and by Yoshida, Kinase, 
(1970). Recently a study by Tucker (1972) 


The present experiment was conducted in order to study the relation 


between the results from negative similarity estimations and the results 
gained from studies 


where only positive estimations had been used. 
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Method 


The Pilot Study 


In 1955 Ekman conducted a similarity study of emotional terms. He 
analyzed the data using Thurstone's centroid method (1947) and 
found eleven factors. Fifteen of the 23 words used in his study were 
used in a pilot study for the present investigations, where the aim was 
to find out if a sufficient number of the words were consistently 
perceived as being of an opposite nature to each other. The 15 words 
were paired in all the possible combinations and presented to 17 psy- 
chology students. The subjects were instructed to put a plus sign if they 
considered that the pair described similar feelings and a minus sign for 
opposing feelings. 

It was found that 25 of the 105 pairs were perceived as opposites by 
every one of the subjects, whereas only five of the pairs were con- 
sistently perceived as similar. Consequently, it was considered possible 
to use these stimuli for studying negative similarities. 


The Main Experiments 


Ekman (1955) used 168 psychology students in his study. The sub- 
jects were requested to make their estimations on a 5-point scale rang- 
ing from 0 to 4, and to judge the qualitative similarity between the 
words describing emotional states. Zero indicated “no similarity at 
all" and four indicated “identity.” It was assumed that the emphasis 
On the qualitative similarity would result in estimates of equally intense 
emotions which could be represented by vectors of equal length in the 
multidimensional model. Thus after a transformation to а scale from 0 
to 1, the similarity estimates should not deviate very much from the 
theoretical cosine values (see also Ekman, 1970). 4 | 
: Two main experiments were carried out for the present investiga- 
tion. As far as possible Experiment I was an exact replication of 
Ekman’s study from 1955, and 150 psychology students were used as 
subjects, The purpose of this experiment was to yield up-to-date 
results concerning the semantic meaning of the emotional terms. This 
was considered necessary as some changes in the usage of language 
may have occurred during this period. Yoshida, Kinase, Kurokawa 
and Yashiro (1970) take up some aspects of this point, Another 150 
students took part in Experiment И, and the procedure was essentially 
the same as in Experiment I. The only difference was that the subjects 
in the second study were requested to give similarity estimations оп a 
nine point scale ranging from +4 to —4, where positive values in- 
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dicated the degree of positive similarity between the feelings described 
by the words, negative values indicated the degree of opposite 
relationship, and zero indicated that the words were in no way related 
to each other (see also Table 2). 

The English words presented by Ekman (1955) and referred to in the 
present investigation may in some cases deviate from the translations 
which we consider to give a better interpretation of the Swedish words. 
For these cases we have put Ekman's translations within parenthesis. 
One Swedish word (saknad) seemed to require the two English words 
"miss" and “lack” to clarify its meaning. 


Results 


The arithmetic means of the estimations were calculated for each 
pair of words in both experiments, and a uniform reduction of the scale 
was made to a range from 0 to +1, and from —1 to +1, respectively. А 
principal component solution was performed on the two matrices. 
Unity was used in the main diagonal for the communality estimations, 
and seven factors were extracted and varimax rotated for Experiment I 
and six for Experiment П. The number of factors extracted was 
decided according to the number of eigenvalues greater than unity. 

The factors obtained in Experiments I and II were compared to each 
other and interpreted as shown in Table 1. It was found that all the 
factors obtained in Experiment I correspond closely to factors from 
Experiment II in the way schematically demonstrated in Figure |. 
Four factors from Experiment II appeared as bipolar factors and two 
as unipolar factors, There is one bipolar factor in Experiment II 
(Discontentedness and Contentedness), which did not appear in Ex- 
periment I, and the Suggestion of the second half of a bipolar factor 
(Passive Repulsion), Not too much emphasis should be placed on the 
particular labels used to denote the factors. Nevertheless, it is cleat 
that all the information gained Experiment I was available in Ехрегі- 
ment II, and moreover, some additional information was obtained. 

In order to investigate the manner in which the results from Ехрегі- 
ments I and II are related to each other, three hypotheses were sug- 
gested, А 

Hypothesis 1.—The negative values in Experiment И were 
Tepresented as zero ratings in Experiment I. The implication is that the 
subjects were able to discriminate between the zero ratings in Experi- 
ment I but had no means available to represent this differentation. 

Hypothesis 2.—The negative values in Experiment II were 
represented as the lower half of the Positive scale in Experiment I. This 
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Experiment I Experiment П 


Factor IV 


Figure |. The diagram shows the corresponding factors obtained in Experiments 1 
Ee Il. The broken lines indicate that no corresponding factor was obtained in Experi- 
ent 1. 


implies that the subjects spread out the positive values so that the 
lower values were shifted onto the negative side. 
4 Hypothesis 3.--Тһе negative values іп Experiment II were a reflec- 
lion of some of the positive values given in Experiment I. This implies 
that if the negative values are pivoted around the zero point of the 
Scale, then Experiment II will give the same results as Experiment I. 
Table 2 shows the percentage distributions for the different scale 
Values in Experiments I and II, and also the way in which the empirical 
Scale values were transformed according to the three hypotheses. The 
transformed values were then uniformly reduced to range from Otol 
and were factor analyzed by the same method as used above. For 
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TABLE 2 


Actual and Transformed Similarity Scale Values and Percentage Distribution of Empirical Estim 


1 Experiment II 
Experiment I 
Actual Values Hypothesis 1 Hypothesis 2 Hypothesis 
*4( 3.9%) +4 ( 3.5%) +4 +4 +4 | 
+3(10.4%) +3( 89%) +3 +3.5 +3 
+2 (13.4%) +2 (10.9%) +2 +3 42 
+1 (17.9%) +1 (13.0%) +1 +2.5 +1 
0 (54.4 %) 0 (37.6%) 0 +2 [m 
-1( 47%) 0 +1.5 +1 
-2( 69%) 0 +1 +2 
-3( 8.2%) 0 +0.5 +3 
—4( 6.3%) 0 0 +4 


Hypothesis 1 seven factors were extracted and rotated to obtain the 
best comparison with the factor matrix from Experiment І. This was | 
done using a computer program (RELATE) by Veldman (1967). This 
program rotates a problem factor matrix so that maximum contiguity 
with a hypothesis (target) matrix is achieved. Tucker's coefficient of 
congruence (see Harman, 1967, p. 270) was then calculated as a 
measure of factor similarity and the resulting values are presented in 
Table 3. All the coefficients are higher than 0.995, which shows that 
these two matrices are almost exactly the same. The distributions of 
the estimates shown in Table 2 also indicate that Hypothesis 1 is con- 
firmed. 

Three factors were extracted for Hypothesis 2 and five for 
Hypothesis 3, the number of factors extracted being decided according 
to the eigen-values, using the same criterion as given above. The two 
factor matrices were also rotated against the matrix from Experiment 1 
in order to obtain maximum contiguity, and the resulting coefficients 
of congruence are shown in Table 3. These coefficients are con- 
siderably lower than those obtained according to Hypothesis 1. It is 
clear that Hypotheses 2 and 3 are inferior in describing the data from 


TABLE 3 
Tucker's Coefficient of Congruence 
Comparison Raster 
I П ш IV м 
Exp. I/Exp. (Нур. 1) 0.997 0.999 
i Н 0.998 .998 .996 
Exp. 1/Ехр. (Нур. 2) 0.917 0.887 0.932 ре is 
Exp. I/Exp. (Нур. 3) 0.797 0.609 0.641 0.895 0.958 
Exp. I/Ekman 1955 0996 095 0998 0997 0995 
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Experiment I, both with respect to the number of factors and to their 
loadings. 

A reanalysis of Ekman's data (1955) was made by the present 
authors using a principal component solution and this analysis sug- 
gested seven factors. This solution was rotated against Experiment I 
and the coefficients of congruence for the comparison are included in 
Table 3. The coefficients show that there is a very strong relationship. 
Earlier reanalyses of Ekman's data have been carried out by Dietze 
(1963) and Waern (1972), however, a comparison with their results is 
difficult as they used two kinds of typal analyses. 


| 


Discussion 


The results from this investigation show the relation between an ex- 
periment with only positive estimations (Experiment I) and one that 
included negative similarity estimations (Experiment II). The negative 
similarities were found to have been represented mainly as zero ratings 
| when only a positive scale was available for the subjects. Two ques- 
| tions arise naturally from this finding. (1) Do the negative similarity 

estimations from this study contain any meaningful information? (2) 

Does the relation obtained in this study also hold for other kinds of 

stimuli? 

_ There are two reasons why the results from the present experiment 
| indicate an affirmative answer to the first question. The first reason is 
that the negative similarities have yielded bipolar factors which seem 
to be psychologically meaningful. The second reason is that the 
Negative similarities yielded more information in the analysis than was 
obtained when only positive estimations were used. The bipolarity 
Was, of course, suggested by the scaling method used, however, this 
Was also the case for the unipolar factors obtained when positive 
Similarity estimations were used in other studies. 

It is interesting to note that two factors appeared that were not 
bipolar, these being interpreted as Fear and General Agitation. There 
are at least two possibilities why these factors have turned out to be 
unipolar. The first one is that the words which might appear on the 
Other half of a bipolar factor were not included among the stimuli 
Presented to the subjects. The sample of stimuli used in a study like 
this one is very important for both the number of factors obtained and 
the interpretations that can be made of them. A second possibility is 
that the factors represent emotions which are genuinely unipolar. In 
| the case of Fear it is not easy to find an obvious opposite to the word 
Терагдіпр the emotional context corresponding to its meaning, for it 
does not seem unreasonable to think of Fear as starting from some 
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kind of “zero” point. In the case of General Agitation it is quite possi- 
ble that words like “саіт” are experienced as psychologically opposite 
to “restless” within the context of the (cognitive) situation, 
nevertheless, “calmness” would only seem to represent a lesser degree 
of “restlessness” (activity related to a state of balance). 

The bipolarity of another of the factors may be questionable. This 
factor was interpreted as Attraction and Repulsion (passive) and it has 
only one moderate loading on the Repulsion side (0.38 for “detest” 
[disgust]). In this case we believe that the lack of words on this side are 
more likely to be due to the sample of emotional terms used rather 
than the lack of real bipolarity for this factor. Psychologically, the 
bipolarity seems meaningful within the interpretations given above. 

The additional information obtained by using negative similarities 
was one complete bipolar factor and the negative side of another 
bipolar factor. The half of the bipolar factor was named Passive 
Repulsion and has already been commented on. The complete bipolar 
factor was interpreted as Discontentedness and Contentedness, and 
the words highly loaded on each side are “vexed” (angry), "rage" 
(ireful), and “рау,” “benevolent,” respectively. These words were in- 
cluded in other factors in Experiment I, particularly “vexed” (angry) 
and "rage" (ireful) which also appeared in Active Repulsion. We are 
tempted to describe the qualitative nature of Active Repulsion and 
Discontentedness as, in the former case, a feeling directed towards а 
being, and in the latter case, a general condition having more to do 
with a lack of emotional balance and harmony. The second half of the 
latter factor, Contentedness, seems to be less definite as there are по 
very high loadings which would make the interpretation unam- 
biguous. This may also be due to the particular sample of words used. 

The second question mentioned above can only be answered by 4 
study using other stimuli. 

Finally, it should be remembered that both the present study and 
Ekman’s (1955) were carried out with Swedish subjects and with Swedish 
words, and that the interpretations of the factors were made with 
the nuances of the Swedish words in mind. 
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SCALING ATTITUDE ITEMS: A COMPARISON OF 
SCALOGRAM ANALYSIS AND ORDERING THEORY 


PETER W, AIRASIAN, GEORGE F. MADAUS, AND ELINOR M. WOODS 
Boston College 


The study compared an ordering-theoretic method of identifying 
item hierarchies with scalogram analysis in the evaluation of an 
eight item attitude measure. The attitude measure assessed “рго- 
gressive" and "traditional" views of education. Data were collected 
in a survey of a random sample of 178 parents of public school chil- 
dren. The scalogram analysis revealed that the items did not form a 
unidimensional and cumulative hierarchy. The ordering-theoretic 
analysis identified a branched, nonlinear hierarchy which had higher 
reproducibility and scalability than the linear hierarchy identified by 
the scalogram analysis. The results support the use of ordering 
theory in defining item hierarchies in attitudinal measures. 


THE purpose of this study was to compare the results of a scalogram 
analysis (Guttman, 1944, 1950) of eight items assessing attitude 
toward education to the results of an ordering theoretic analysis, а 
mode of analysis which is capable of identifying branched hierarchies 
among items (Airasian and Bart, 1973). Utilizing statistics such as 
reproducibility (Guttman, 1950) and the coefficient of scalability 
(Anderson, 1966), the study sought to identify which method of 
puris provided an item hierarchy which best fitted the observed 

ata. 

Тһе classical method of scaling attitudinal items into ап ordered 
hierarchy is scalogram analysis (Guttman, 1944, 1950). Scalogram 
analysis is used to order a group of items into a linear hierarchy and to 
Саша whether or not the hierarchy is unidimensional and 
» Cumulative, The degree to which a group of items is judged to possess 
these properties is determined by the extent to which “passes” (scores 

of 1) оп any item co-occur with “passes” on all items ranked lower in 
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the hierarchy. The inverse is also true. That is, a scale is unidimen- 
sional and cumulative insofar as “failures” (scores of 0) on an item co- 
occur with "failures" on items ranked higher in the hierarchy. 

Except in a few rare instances, most notably the articulation of 
social distance scales, scalogram analysis has been used with disap- 
pointing results. A primary reason for the lack of demonstrated 
viability of the approach is the constraint that scaled items must 
manifest a linear hierarchy. A linear hierarchy is one in which one and 
only one item appears at any given hierarchic level and one and only 
one item is immediately prerequisite to or a sufficient condition for any 
given item in the hierarchy. As a consequence of this constraint, 
reliable Guttman scales exceeding five items are rare. Methodological 
extensions of scalogram procedures such as multiple scalogram 
analysis (Lingoes, 1963) and latent structure analysis (Lazarsfeld, 
1959) are similarly constrained to identifying only linear hierarchies 
among items. 

In addition to methodological difficulties in defining reliable linear 
hierarchies consisting of more than five items, there are conceptual 
problems associated with the search for item domains which can ap- 
propriately be represented by linear hierarchies. Recent research in the 
areas of cognitive development (e.g., Airasian and Bart, 1972), tests 
and measurement (e.g., Airasian, 1971a; Cox and Graham, 1966, 
Walbesser, 1968), and curriculum development (e.g., Gagne, 1962, 
1967, 1968; Gagne and Paradise, 1961) has demonstrated that groups 
of test items or curricular tasks are more likely to be ordered in non- 
linear, branched hierarchies than in linear hierarchies. While nonlinear 
hierarchies do not manifest the conceptual simplicity of linear 
hierarchies, it is likely that for most sets of test items or curricular 
tasks, nonlinear hierarchies represent a more accurate description of 
the relationships between items or tasks than do linear hierarchies: 


Ordering Theory 


‚ Ordering theory (Airasian and Bart, 1973; Bart and Krus, 1973) 
18 à measurement model which is based upon scalogram analysis but 
which extends Sealogram techniques to the articulation of nonlinear 
item or task hierarchies, Ordering theory is an approach to fundamen 
tal measurement and has as its primary intent either the determination 
of a hierarchy for a set of items or tasks or the verification of an ® 
priori, hypothesized hierarchy among a set of items and tasks. A ® 
foundational level of explanation, ordering analysis possesses 4 
boolean algebraic framework in which item or task response patterns 
are viewed as atoms in a boolean algebra with as many generators 25 
there are items being considered (Goodstein, 1963). Although thet? 


~ 
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are many ways to describe ordering theory, it is most meaningfully 
classified as a deterministic measurement model which uses item 
response patterns as the basic data points to identify both linear and 
nonlinear, qualitative, prerequisite relations among test items or tasks. 
In this regard, it is a more general case of scalogram analysis. 

Ordering-theoretic procedures were used in this study to determine 
prerequisite relations between pairs of items assessing attitude toward 
education. The prerequisite relation is considered here, since that type 
of relation is of primary interest to behavioral scientists, especially in 
their quest to identify sequential or causal relationships among 
phenomena. 

An item i is a prerequisite to an item j to the extent that the (0, 1) 
response pattern, where 0 represents the score on item i and 1 repre- 
sents the score on item /, occurs infrequently. The (0, 1) response pat- 
tern is viewed as a disconfirmation that a correct or acceptable 
response to item i is a prerequisite to a correct or acceptable response 
to item j. For any pair of items, a 2 X 2 table showing the number of 
"passes" or “fails” on the items can be constructed. Table 1 represents 
à hypothetical example of such a table. In Table 1, the (0, 1) response 
cell has а frequency of 0, indicating that no subject attained item j after 
failing item i. As a consequence, item i can be considered to be prere- 
quisite to item j. Note that if the (1, 0) cell also had a frequency of 0, so 
that all subjects manifested either a (0, 0) or (1, 1) response pattern, 
item ; and j would be equivalent. That is, they would not be 
hierarchically ordered and would, in fact, be extracting redundant in- 
formation. Constructing contingency tables such as shown in Table 1 
for all possible item pairs is one method of extracting the prerequisite 
relations among a set of items. In this study, a computer program 
(Lele and Bart, 1971) was utilized in the data analyses. 

In defining prerequisite relations between items, ordering theory 
Shares one limitation with scalogram analysis. Both measurement 
models are deterministic. Thus, neither incorporates а method of deal- 
ing with the possibility of encountering random error in item response 
Patterns. To overcome this limitation, ordering theoretic analyses rely 
upon the use of a preset tolerance level for error. The tolerance level 
Sets the number of disconfirmatory response patterns which will be ac- 
cepted in defining a prerequisite relation between two items. Thus, for 


E 1 TABLE 1 ) 
Xample ofa 2 x 2 Table to Determine a Prerequisite Relation between Two Items 


item j 
0 1 


item i ү CUM] 
г [ш | 20] 
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a 5% tolerance level and N subjects, one would tolerate at most (.05)N 
disconfirmatory response patterns between items in an item pair 
before accepting a prerequisite relation. Referring to Table 1, for a 5% 
tolerance level and 100 subjects, 5 (0, 1) response patterns could be 
manifested before the prerequisite relation between items i and j was 
rejected. 

There are two general strategies for the implementation of tolerance 
levels in examining the hierarchical structure or ordering among a set 
of items. Within one strategy, a hierarchy and its array of prerequisite 
relations is hypothesized for a set of items or tasks before data collec- 
tion. The response patterns that would disconfirm the entire hierarchy 
are identified and a tolerance level is established (Airasian, 19716, 
1971c). The hypothesized hierarchy is then accepted if the frequency of 
obtained disconfirmatory response patterns is less than or equal to the 
prescribed tolerance level, 

An alternative strategy can be used when no a priori hierarchy 
among items is hypothesized. This strategy, which was used in this 
study, identifies prerequisite relationships between item pairs from the 
obtained response patterns. In this approach, all possible item pairs 
are investigated to identify prerequisite relations. The prerequisite 
relation for a particular pair of items is accepted if the frequency of ob- 
tained disconfirmatory Tesponse patterns for the item pair is less than 
or equal to the frequency of such response patterns established by the 
tolerance level. This procedure is followed to test each of the possible 
hypothesized prerequisite relations with the same tolerance level being 
used for each testing, 

Several discussions of ordering theory have been provided. Airasian 
and Bart (1973) articulated the general nature of ordering theory. Bart 
and Krus (1973) described an ordering-theoretic technique by which 
item or task hierarchies could be determined. Krus and Bart (1974) 
discussed an ordering-theoretic technique to scale items in a mul 
tidimensional setting. Airasian, Bart and Greaney (1973) and Bart and 
Airasian (1972) investigated the hierarchies among Piagetian tasks by 
means of ordering theory. Two defining properties of ordering theory 
cited in these discussions are that all test items or tasks to be examined 
must be dichotomously scored and that all subjects in a sample must 
respond to all of the items or tasks. In this study an ordering-theoretie 


analysis was applied to an attitudinal scale assessing attitude towards 
education. 


Procedures 


The e scale Which was studied consisted of eight statements 
measuring a “progressive” апа “traditional” view of education 
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(Kerlinger, 1967). The eight items were part of an interview instrument 
administered to a random sample of 183 parents of public school 
children in a suburb of Boston. The items comprising the scale are 
listed in Table 2 along with the responses expected for a subject 
categorized as a “progressive.” 

Each of the eight statements permitted five response options: agree 
strongly, agree somewhat, disagree somewhat, disagree strongly, and 
no opinion. To meet the constraint of ordering theory that item 
responses be bivalued, the two "agree" response options were col- 
lapsed into one category and the two "disagree" response options into 
a second category. A total of 5 of the original 183 parents selected the 
"no opinion" option in responding to one or more of the eight items. 
These respondents were eliminated from the sample reducing the final 
sample size in the study to 178. 

For the purposes of analysis, the eight statements were scored in the 
direction of the “progressive” view of education, with the “progres- 
Sive" response to each item being coded 1 (pass) and the "traditional" 
response 0 (failure). Thus, if an individual disagreed with item 1 in 
Table 2, he scored 1 on that item, since his response matched the 
"progressive" key. The selection of the “progressive” scoring key was 
arbitrary. The need was to establish, for each, respondent, an item 
response pattern comprised of 0’s and 175 in the same metric as the 
other respondents’ patterns. Had “traditional” responses been 


TABLE 2 > y 
“Progressive” and “Traditional” Items and “ Progressive Scoring Key 
TE Progressive 
Response 
1, Teachers should keep іп mind that pupils 
have to be made to work. з 
р 


2 More authority is needed іп today's classroom. 


Student participation in the forming of school 
policy is a privilege granted by the school D 
rather than a matter of student rights. 


Teachers should encourage pupils to study and 
criticize our own and other economic systems 
and practices. А 


5. Children need to have more supervision and D 
discipline than they are getting. 

$ There should be more emphasis on the three R's. n 

1. А 


Schools should Бе sources of new social ideas. 


Modern schools have too many fads and frills, 
Such as activity programs, driving education, 
crafts, social services and the like. D 
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selected as the scoring key, the hierarchy produced by the analysis 
would have been identical to the “progressive” results, except in in- 
verted order. The coefficient alpha reliability of the dichotomized 
items for the sample of 178 respondents was .71. 


Results 


Figure | shows the linear hierarchy generated by the scalogram 
analysis. The numbers in the scale correspond to item numbers in 
Table 2. Item 4, which stated “Teachers should encourage pupils to 
study and criticize our own and other economic systems and prac- 
tices” emerged as the most Progressive item. The scale ranged upwards 
to item 3, “Student participation in the forming of school policy is a 
privilege granted by the school rather than a matter of student rights,” 
which was the least Progressive item in the scale. 

The coefficient of reproducibility (Guttman, 1950, Torgerson, 
1958), which provides an index of the fit of the data to the linear 
hierarchy resulting from the scalogram analysis, was .84. For an eight 
item hierarchy, reproducibility should be a minimum of .9 if the 
hierarchy is to be considered unidimensional and cumulative. 

Three other Statistics, the minimum marginal reproducibility, the 
percentage improvement, and the coefficient of scalability, were 
calculated for the hierarchy identified by the scalogram analysis 
(Anderson, 1966). The reproducibility of any item can never be less 


3 Traditional 


Progressive 


Figure 1. Scalogram Analysis Item Hierarchy. 
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than the percentage of respondents falling into the most popular 
response option for that item. The minimum marginal reproducibility 
is the minimum coefficient of reproducibility that could have occurred 
for the scale given the percentage of respondents falling into the most 
popular response option for each item. It is the average across all items 
of the percentage of respondents selecting the most popular response 
option on each item. For the scalogram analysis, the minimal marginal 
reproducibility was .72. The percentage improvement, which indicates 
the extent to which the coefficient of reproducibility is due to the re- 
sponse patterns and not the percentage of respondents selecting the 
most popular options on the items, is the difference between the coeffi- 
Чеп! of reproducibility and the minimal marginal reproducibility. 
In the case of the scalogram analysis, the percentage improvement 
was 12, a small value. Finally, the coefficient of scalability, which 
is obtained by dividing the percentage improvement by 1 minus the 
minimum marginal reproducibility, was 43 for the scalogram analysis, 
In order for a hierarchy to be considered truly unidimensional and 
cumulative, the coefficient should be well above .5 (Anderson, 1966). 
Overall, then, the scalogram analysis revealed that a linear hierarchy 
is not an adequate representation of the relationships among the eight 
items, 

Figure 2 shows the hierarchy generated by the ordering-theoretic 
analysis with a tolerance level set at 10%. Two points about this order- 
ing, relative to the scalogram analysis, are noteworthy. First, the items 
Occupy the same relative positions in both hierarchies, although there 
are fewer hierarchical levels in the hierarchy generated by the ordering 
analysis. Thus, while items 1, 2, 3, and 5 are all at the same level in the 
Ordering analysis results, they are at a higher level in the hierarchy 
than the remaining items. Similarly, item 6 scales above items 8 and 4 
in both hierarchies, as does item 8 above item 4. Second, the ordering 
analysis reveals where the hierarchy for the eight items departs from 
linearity. Scalogram analysis reveals the best fitting linear hierarchy 
and the extent to which that hierarchy is unidimensional and 
cumulatively hierarchical, It cannot reveal departures from linearity. 
The ordering-theoretic analysis indicated that the best fit for the item 
тезропзе patterns was a non-linear, branched hierarchy. The analysis 
Benerated the form of that hierarchy. Ў 

Figure 2 indicates prerequisite relations as well as relations of 
equivalence and logical independence between the eight items. Agree- 
Ment on one statement is a prerequisite to agreement on another state- 
Ment if the number representing the first statement is connected to the 
number representing the second statement by a line that passes in a 
general upwards direction. For the relation “їз a prerequisite to,” 
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Figure 2. Item Hierarchy from Ordering Analysis. 


phrases such as “із a precondition for” and “is a necessary s 
for" may be viewed as being synonymous. Also, if success on one tas 
is a necessary condition for success on another task, success on the RE 
ond task is a sufficient condition for success on the first task. Table3 | 
lists the logical relations among the eight items identified by the order- 
ing analysis. i 
‘While most “progressives” would agree with item 4, the ordering 
analysis reveals that a split occurs given agreement with item 4. Given 
that the keyed responses to items 7 and 8 are “agree” and j спа 
respectively, some "progressive" respondents indicated disagree 
with item 8, “Modern schools have too many fads and frills... an | 
disagreement with item 7, “Schools should be the source of new yo 
ideas" while others agreed with 8 and 7. The responses of the es 
indicate that the items concerned with “fads and frills” in the schools 
(item 8) and the school teaching new social ideas (item 7) are not опа 
unidimensional continuum. The ordering analysis reveals that the ej 
sue of schools being the source of new social ideas (item 7) is differen 
in kind than the issue of “fads and frills” in school programs (item 8). 


TABLE 3 А i 
Logical Relations among the Eight Items as Defined by the Ordering Analy: 


Item Relation Item(s) 


is prerequisite to 1 
is prerequisite to 1 
is prerequisite to 1 
is prerequisite to I 
is equivalent to 5 
is independent of. 7 
is independent of 7 
is independent of. 2 
is independent of 1 
is independent of 1 
is independent of 1 
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Note that item 6, “There should be more emphasis on the three R's" is 
related to item 8 but not to item 7, further corroboration that the 
respondents perceive the issue of new social ideas to be distinct from 
concern over the more formal academic program. The ordering 
analysis also revealed that items 2, 5, 1, and 3 were the least “ргоргев- 
sive" items. However, these items were not hierarchically ordered 
among themselves. Finally, the double arrow between item 2, “Моге 
authority is needed in today's classroom" and item 5, “Children need 
to have more supervision and discipline than they are getting,” in- 
dicates that these items are equivalent. That is, the parents responded 
identically to items 2 and 5. Ifa respondent agreed with 2, he agreed 
with 5; If he disagreed with 2, he disagreed with 5. In essence, the 
ordering analysis revealed that these two items were extracting redun- 
dant information and that one, but not both, was needed in the scale. 

Table 4 compares the scalogram analysis with the ordering-theoretic 
analysis in terms of the four statistics used to validate a hierarchy. It is 
evident that the branched hierarchy identified by the ordering analysis 
is more reproducible than the linear scalogram analysis hierarchy. The 
nonlinear hierarchy evidenced a reproducibility well above the lower 
limit of scalability discussed by Torgerson (1958). Since the same item 
response patterns were analyzed in both analyses, the minimum 
marginal reproducibilities are identical. Given the higher repro- 
ducibility for the ordering-theoretic results, the percentage improve- 
ment figure is higher for the nonlinear hierarchy than for the linear 
hierarchy, Finally, the hierarchy generated by the ordering analysis 
evidenced a coefficient of scalability considerably larger than the 
hierarchy from the scalogram analysis. 


Conclusions 


use of ordering-theoretic 
les. While nonlinear item 
al simplicity of Gutt- 


The results of the study support the 
analysis in the evaluation of attitude scal 
hierarchies may not always match the сопсеріш 


TABLE 4 у 
Comparison of Scalogram and Ordering-Theoretic 
Results on Hierarchy Validation Statistics 


suma Scalogram Ordering 
atistic Analysis Asta 
Reproducibility % E 
ч inimal Marginal Reproducibility 1; % 
‘ercentage Improvement УТ E 


Coefficient of Scalability 
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man scales, it is likely that nonlinear hierarchies provide a more ac- 
curate representation of the inter-item relationships existing in 
various attitudinal domains. The depiction of a branch hierarchy of 
specific logical relationships among attitudinal items, rather than the 
simple linear sequence afforded by scalogram analysis, may provide 
the researcher with more insight into the dynamics underlying the at- 
titude. 

Whenever logical relationships between test items or tasks are of in- 
terest, ordering theory can be used. Ordering theory can reveal non- 
linear lines of implication among items or tasks and in so doing, serve 
as a basis for hypothesizing lines of causation to be tested in ex- 
perimental settings. 
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THE DIFFERENTIAL FORMATION OF RESPONSE 
SETS BY SPECIFIC DETERMINERS' 
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This investigation was conducted to determine if both position and 
alternative length specific determiners cause a differential formation 
of response sets on tests in high and low scoring groups. А 46 item 
vocabulary test with 10 alternate forms varied by type and frequency 
of specific determiners was administered to 1000 undergraduate 
Psychology students. Results indicated that as the frequency of 
specific determiners increased, they formed increasingly strong but 
differential guessing response sets in high and low scoring groups; 
however, the magnitude of the effect was much stronger for position 
specific determiners. Results were interpreted in terms of differing 
frequencies of appearance in existing tests and the actual nature of 
the responses an examinee makes to multiple-choice items. 


IN the mid-1920's, educators perceived the value of using objective 
examinations as a method for testing the achievement or aptitude of 
individuals and published a variety of articles and books which discuss 
specific recommendations for objective item and test construction 
(Weideman; 1926; Orleans and Sealy, 1928; Odell, 1928). Included in 
the ensuing discussion of the technical considerations that must be 
dealt with in the writing of objective test items was more caution to 
àvoid providing clues or "specific determiners" which raised above the 
chance level the probability of success on an item about which the ex- 
aminee has no knowledge (Lang, 1930; Hawks, Linquist, and Mann, 
1936; Remmers and Gage, 1955; Travers, 1955; Garrett, 1964; and 
Stanley, 1964). 


—————— 
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During the late 1920's, other investigators began to examine the ex- 
istence of response tendencies associated with examinees rather than 
specific item construction. These response tendencies, now called 
Tesponse sets, seem to cause “а person consistently to give different 
responses to items than he would when the same content is presented 
in a different form" (Cronbach, 1946). 

During the next 30 years, response sets and specific determiners 
were treated as rather independent entities. Moreover, test writers 
seemed to agree with Cronbach (1946) that multiple-choice objective 
examinations were free from Tesponse sets. However, in 1962, Wevrick 
performed the first empirical investigation of the possibility of an in- 
teraction between the two entities, - 

Wevrick's study was designed “to determine if conditions could be 
contrived such that positional response biases would be demonstrated 
in a highly structured multiple-choice test." Wevrick’s hypotheses 
were: (a) for each item, if the correct alternative occupies the biased 
Position, then the proportion of correct responses made to that item 
will increase; and, (b) if the Position of the correct alternative is ran- 
domly distributed across items, a (position) response set will not in- 
fluence the total score distribution. 

Wevrick successfully established a response set for the keyed posi- 

tion by utilizing a test whose arrangement led to a single position being 
the keyed answer nearly all of the time for the first (easiest) test items 
and thereafter decreasing in frequency of appearance. However, no 
Such response set was established for tests involving randomly keyed 
alternatives, 
‚ A two-part study reported by Chase (1964) demonstrated the ex- 
Istence of a sui-generis Tesponse set to choose markedly longer alter- 
natives in a multiple-choice test on a topic about which a sample of 
College students had no knowledge. A second part of the study further 
demonstrated that with a carefully constructed test, this preexisting 
Tesponse set could be removed. 

To Summarize, Wevrick’s and Chase’s studies seem to imply that 
tere 18 а causal relationship between specific determiners and the for- 
а НЫ sets. If this relationship does in fact exist, then any 
и, Contains specific determiners may have its reliability 

puriously inflated by correlated, but irrelevant, consistencies and its 
validity lowered since those co | 
those abilities the test is t; 
the formation of. th 
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The purpose of this study, therefore, is to determine if in fact there is 
a differential formation of response sets in groups having high and low 
test scores through the action of specific determiners and to determine 
if this formation of response sets occurs with two types of specific 
determiners. 


Methods 


Construction— Position Specific Determiner 


The general construction of all forms examining the differential for- 

mation of response sets in high and low performance groups by posi- 
tion specific determiners was similar. Each test consisted of 46 four 
alternative multiple-choice vocabulary items in which the examinee 
was to choose the alternative which was most closely synonymous to 
the stem word. Nine of the items on each test were "pure guess" items 
in which the stem word was selected from a list of words given in the 
Teacher's Word Book of 30,000 Words (Thorndike and Lorge, 1944) as 
appearing four times in 18,000,000 occurrences and none of the alter- 
natives were synonyms or antonyms of the stem word. 
_ The items in each test were arranged so that each of the first four 
items was keyed in a different position and the next two items were 
“pure guess” or probe items. This arrangement allowed the research to 
determine, by examining the probe items, whether or not an initial 
Position response set was in existence. The remaining 40 items, in- 
cluding seven probe (or “pure guess") items spaced evenly 
throughout, were keyed according to the following schedule: 


Form 0—twenty-five percent of the items were keyed in each posi- 
tion 

Form 4—forty percent of the items were keyed in position two, and 
the rest of the keyed responses were equally distributed 
among the other three alternatives ) 

Form 5—fifty percent of the items were keyed in position two, and 
the rest of the keyed responses Were equally distributed 
among the other three alternatives " 

Form 6—sixty percent of the items were keyed in position two, and 
the rest of the keyed responses Were equally distributed 
among the other three alternatives : pi 

Form 7—seventy percent of the items were keyed in position two, 
and the rest of the keyed responses Were equally dis- 
tributed among the other three alternatives 


All items with a keyed response were chosen from a list of one 
hundred items provided by Ruff (1967) who developed response dis- 
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crimination indices for these items in a sample of junior and senior 
level high school students. Stem words and alternatives with the 
highest response discrimination indices (alternative-total score cor- 
relations ranging from .40 to .78 for the correct alternatives and from 
—11 to —.67 for the incorrect alternatives) were selected to be included 
in this study. 

Each form of the test consisted of the same items, with alternatives 
arranged in an order that would provide the desired percentage occur- 
rence of the position specific determiner. 


Administration— Position Specific Determiner 


One hundred copies of each form were administered without time 
limit to separate sections of an Introductory Psychology class at the 
University of Tennessee. There were no oral instructions. Instructors 
in each of these sections had asked one hundred students to participate 
in a graduate research study. Written instructions on the first page of 
each form were as follows: 


The following test attempts to measure intelligence by responses 
indicating knowledge of the meaning of abstract words. There are 
à few easy items which nearly everyone will get right and a few 
Very hard items which almost no one will answer correctly. For 
Our research purposes, we need to have an answer for every item 
оп the test, so no matter how unfamiliar a word may be, please 
record the best response you can, even if И is a pure guess. It is 
critical to our research that you work straight through the test and 
not skip around in answering the questions. Please place all your 
answers on the answer sheet. Do not write on the test itself. ON 
YOUR ANSWER SHEET PLACE THE NUMBER OF THE 
WORD WHICH HAS THE MEANING CLOSEST TO THE 
FIRST (CAPITALIZED) WORD FOR EACH ITEM. 


Scoring—Position Specific Determiner 


The scoring on all forms was as follows: 

1. The number of correct, keyed items was determined for each of 
the 100 subjects, 

2. The top 25 scores and bottom 25 scores were selected as consti- 
tuting the High and Low Groups, respectively, In the case of ties 
for the 25th score in each group, a random selection was made 
among all those tied scores. 

3. The alternative selected for each of the “pure guess" items Was 
tabulated Separately for all members in each of the two groups- 
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The general construction of all tests examining the differential for- 
mation of response sets in High and Low performance groups by alter- 
native length specific determiners was similar to that of the forms 
previously discussed. 

Each test consisted of 46 four alternative multiple-choice 
vocabulary items in which the examinee was to choose the alternative 
which provided the best definition of the stem word. Each of these 
items was formed by simply providing the definitions of the synonyms 
used on the forms for testing position specific determiners. 

Each form of test consisted of the same items, with the longest alter- 
native being the keyed response according to the following schedule: 

Form 00—the longest alternative was keyed 25% of the time. 

Form 8—the longest alternative was keyed 40% of the time. 

Form 9—the longest alternative was keyed 50% of the time. 

Form 10—{һе longest alternative was keyed 60% of the time. 

Form 11—the longest alternative was keyed 70% of the time. 

Care was taken to ensure that any position was keyed only on 
twenty-five percent of the items and that each “longest” alternative 
was at least twice as long as any other alternative (based on actual 
word count). 


Administration—Longest Alternative Specific Determiner 


Procedures for administering these forms were identical to those for 
administering forms examining position specific determiners with the 
exception that the final sentence of instructions was changed to read, 
“ON YOUR ANSWER SHEET PLACE THE NUMBER OF THE 
DEFINITION WHICH MOST CLOSELY DEFINES THE 
CAPITALIZED WORD FOR EACH ITEM.” 


Scoring—Longest Alternative Specific Determiner 


Procedures for selecting the 25 members of the High Group and 
Low Group were identical to those for the position specific determiner 
lest forms, After these groups were formed, tabulation was made 
Whether or not the longest alternative was selected for each “pure 
guess” item for all individuals in both groups. 


Results 


Data (Forms 0,4,5,6, and 7) examining the prior existence of a 
Tésponse set to guess position number two were analyzed using a t-test 
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for the mean number of times position two was selected on the first 
two probe items against the mean value expected by chance. The com- 
puted t-statistic was 0.50, a value not significantly different from 
chance. 

Data examining the formation of response sets by position specific 
determiners were initially analyzed using a one-factor analysis of 
variance for probe items three through nine, in which the main effect 
was the percentage occurrence of items keyed in position number two 
on each form of the test. Separate analyses were conducted for groups 
designated as "High" and “Low” based on their number of correct 
responses on non-probe items. Results of the separate analyses 
demonstrated a significant main effect in both the High Group (F = 
5.62, p < .001) and Low Group (Е = 2.53, p < .05). 

Further analyses were completed by computing 1-statistics for all 
forms within each group, testing the difference between the mean 
number of times position two was selected by individuals within each 
group against the mean number of times that alternative would be ex- 
pected to be selected by chance. 

As demonstrated by Table 1, all means were ordered in the expected 
direction in both groups. 

A sign test computed on this ordering was significant at p = .031 for 
each group. Also, neither of the two control groups (Form 0, High 
Group and Low Group) selected position number two significantly 
more often than expected by chance; whereas, with one exception, 
position two was selected significantly more often than chance would 
predict in all other groups. Illustration of these data are provided in 
Figure 1. 


Additional t-statistics were computed that tested the differences 


Д z TABLE 1 
t-Tests for Position Two against Theoretical Mean for High and Low Groups 


Oberserved Theoretical 


High Groups Mean Mean t P 
Form0 1.76 175 01 № 
Form4 2.64 175 2.81 005 
Form 5 2.64 135 3.02 005 
Form 6 304 135 6.34 001 
Еогт 7 3.48 1.75 8.04 1001 

Low Groups 
Form 0 1.60 

1 5 71 № 
Form4 2.16 175 2.28 95 
Form 5 220 135 1.62 NS 
Form 6 244 175 290 05 
Form 7 2.64 3.09 005 
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e——e HIGH GROUP 
о---9 LOW GROUP 


FORMS 


Mean frequencies for selection of position two in High and Low groups. 


ween the mean number of times position number two was selected 


High and Low Groups on each form. As shown in Table 2, 
cant differences existed between these groups оп Forms 6 and 7. 
(Forms 00, 8, 9, 10, and 11) examining the prior existence of a 
ponse set to guess the longest alternative were analyzed using a 1- 


comparing the mean number of times that alternative was selected 
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TABLE 2 
t-Tests for Position Mean Differences between High and Low Groups 


Form t р 
0 .53 NS 
4 1.26 NS 
5 1.10 NS 
6 1.71 05 
T 221 025 


on the first two probe items to the mean number of times it would be 
expected to be selected by chance. The computed t-statistic was 0.50, а 
value not significantly different from chance. 

Data examining the formation of response sets by overkeying the 


longest alternative were analyzed using a one-factor analysis of 


variance for probe items three through nine in the same manner as that 
for the formation of response sets by position specific determiners, 
Results of these analyses indicate a significant main effect between 
forms for the High Groups (Е = 3.31, р < .05), while there was no 
E difference between forms for the Low Groups (F = 0.45, p 

Computations of t-statistics testing the difference between the mean 
number of times the longest alternative was selected by individuals 
with each group against the number of times that alternative would be 
expected to be selected by chance are presented in Table 3. 

As shown by this table, only in one of the ten groups is the longest 
alternative selected significantly more often than expected by chance. 
However, all means within the High Group are ordered in the 
predicted direction. Illustration of these data is provided in Figure 2. 


TABLE 3 
t Tests for Longest Alternative against Theoretical Mean for High and Low Groups 
Eor Longest Alternative against Theoretical Mean for High and Low Groups 


Observed Theoretical 


High Groups Mean Mean t Р 
Form 00 144 1.75 -247 95 
Form 8 1.64 1:75 53 № 
Form9 138 1.75 49 № 
Form 10 2.08 1:75 1.13 NS 
И 2.64 175 2.65 0 

Low Groups 
Form 00 1.43 № 

Р 1.75 -1.32 
В 1.68 1.75 = 38 № 
огт9 1.52 1.75 -1.14 NS 
Form 10 180 1.75 24 № 
S 176 175 ‘05 NS 
ee з. 005 МЖ. 


MEAN FREQUENCY 
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igure 2. Mean frequencies for selection of longest alternative in High and Low groups. 


е the longest alternative was Se 
к ап dictated by chance іп Form 11 
esting the difference between the mea 


lected significantly more often 
only, an additional t-statistic 
n number of times the longest 


alternative was selected in the High and Low Groups was computed 


шэ Гог that form. The computed t 
yond the .01 level. 


statistic was 2.61, significant 
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Discussion 


The results of the analysis of data for both position and alternative 
length specific determiners have led the authors to four conclusions: 


1. Specific determiners can cause the formation of response sets. 

2. The strength of a response set for a given test is positively related 
to the number of correct answers an individual makes on that 
test. Thus, differential response set formation occurs. 

3. The strength of a response set varies with the extent to which the 
specific determiner occurs within a test. 

4. The strength of a response set appears to vary according to the 
type of specific determiner, e.g., the position response set was 
more easily formed than the alternative length response set. 


Why have these phenomena occurred? Wevrick's (1962) discussion 
implies that the formation of response sets by specific determiners is an 
instrumental conditioning phenomenon with partial reinforcement 
operating to sustain the response set. The reasoning behind this in- 
strumental conditioning approach seems to be that the test item alter- 
natives act as a stimulus condition, marking an alternative serves as 
the operant response, and knowing that one has marked the correct 
alternative is the reinforcement. Through the operation of specific deter- 
miners, the response of marking a specific alternative is reinforced by 
knowledge that the alternative is correct, and thus the occurrence of 
marking that alternative is increased. However, the major problem 
with this interpretation is that, in the test situation, the reinforcement 
precedes the conditioned response rather than following it. That is, the 
examinee's reason for marking a particular response is that he already 
believes it is the correct one. Therefore, the situation is in clear viola- 
tion of the contingency principle of instrumental conditioning. 

The investigators feel that a classical conditioning analysis provides 
a clearer explanation of why the phenomenon occurs. In this analysis. 
the conditioned stimulus, position two, occurs prior to the onset of the 
unconditioned stimulus, recognizing the correct alternative. The pair- 
ing of position two with recognition of the correct response for à 
number of trials appears to cause that position to elicit the conditioned 
Tesponse of marking it given the absence of a conflicting stronger 
Fesponse (recognizing the correct alternative in some position other 
than two). 

This classical conditioning analysis would seem to provide an €x- 
planatory Scheme for (a) the formation of response sets; (b) the 
positive relation between the Strength of response sets and the number 
of correct answers (the higher scorers have more conditioning trials 
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and fewer extinction trials); and (c) the positive relationship between 
the strength of the response set and the extent to which the specific 
determiner occurs (more learning trials). 

However, the analysis does not explain why the position specific 
determiners established a stronger response set than the alternative 
length specific determiner. f 

There seems to be two plausible reasons why position specific deter- 
miners form strong guessing response sets while response length 
specific determiners do not. The first is that since these data have in 
fact demonstrated that the formation of guessing response sets by 
specific determiners is a function of the amount of exposure to that 
specific determiner, and since analyses by Jones (1972) demonstrated 
that position specific determiners occur in both published aptitude and 
achievement tests and teacher-made tests far more frequently than do 
response length specific determiners, one may logically conclude that 
examinees have been more exposed to this type of specific determiner 
and are more prone to perceive it and make their guesses accordingly. 

A second explanation for these results is based on the difference 
between the alternative selected acting as а stimulus to or a response of 
the individual examinee. Position specific determiners act not only as 
stimuli for the subject’s response, but more importantly, act as a 
response themselves. That is, examinees actually make the response of 
marking position two, for example, on their answer sheets. On the 
other hand, the longest alternatives function only as stimuli for a 
response. That is, even though the examinee selects the longest alter- 
native as the correct answer, his response is still the marking of the 
Position that alternative occupies on his answer sheet. Thus, it might 
very well be speculated that the key element in the formation of 
tesponse sets by specific determiners is the response the subject makes, 
not the stimulus for that response. 

This hypothesis can be tested in future research by simply having the 
Subject write down on the answer sheet the alternative he has chosen. 
Thus, when probing for the formation of guessing response sets by 
тезропзе length specific determiners, if the subject writes down the 
longest alternative, it serves not only as а stimulus, but as his response. 
If the hypothesis holds true, one would predict the strong differential 
formation of guessing response sets to occur as à result of the longest 
alternative specific determiner just as for the position specific deter- 
miner, 

The practical significance of the findings of this study depends upon 
the extent to which specific determiners occur in actual tests, both 
leacher-made and professionally constructed published examinations. 

Two studies cast some light on this matter. Metfessel and Sax (1958) 
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examined 30 multiple-choice and true-false standardized published 
achievement and aptitude tests and tabulated the position of the keyed 
responses. They obtained Chi-squares significant at the .01 level of 
probability or better for 42% of the tests they examined and concluded 
that there seemed to be a general trend in multiple-choice tests for the 
keyed answer to be in the middle of the distribution, and in the true- 
false tests, test makers tended to write more true questions than false, 

More recently, Jones (1972) examined 105 teacher-made college 
course tests and 49 published aptitude and achievement tests for the 
existence of position or alternative length specific determiners. A Chi- 
square analysis applied to the data gathered in this study 
demonstrated that 37 of the teacher-made tests and five of the 
published tests rejected the hypothesis that “по position is keyed more 
often than dictated by chance" at least at the .05 level of probability. It 
was stated, however, that it was suspected that for some of the 
published tests, only the relatively small number of items in a sub-test 
prevented the Chi-square analysis from being significant. Support for 
the supposition was provided by the fact that ten of the sub-tests had 
one alternative that was never keyed and yet had a computed Chi- 
square that did not differ significantly from chance. 

This study further demonstrated that six of the teacher-made tests 
and none of the published tests seemed to involve the significant 
overkeying of a “clearly longer" alternative. 

\ Thus, it seems likely that specific determiners do exist in real testing 
situations and they often exist in sufficient concentration to cause the 
formation of a guessing response set. 

The practical significance of this study is apparent. The failure to 
eliminate such specific determiners will result in spurious measurement 
of reliability and validity due to the inclusion of a measured correlated 


error and will in addition create a bias in favor of more knowledgeable 
students. 
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STRUCTURE OF ACADEMIC ATTITUDES AND 
STUDY HABITS 


$. B. KHAN 
The Ontario Institute for Studies in Education 


DENNIS M. ROBERTS 
The Pennsylvania State University 


SSHA was administered to a total of 846 students, Their responses 
were analyzed to test the a priori classification of items into DA, 
WM, TA, and EA scales and to test the hierarchical structure of the 
scales, Transformations of the initial factor matrix to varimax and 
congruence to a hypothesized structure supported the classification 
of items into DA, WM, and TA scales but not the EA scale. The 
higher-order factoring of the inital factor-intercorrelations revealed a 
two-stage hierarchical structure. Implications for students’ counsel- 
ling are discussed. 


Тн evaluation of students' attitudes and study habits has, in part, 
been prompted by the inability of aptitude variables to account for a 
major portion of variance in school learning and achievement. Brown 
and Holtzman (1953) developed the Survey of Study Habits and At- 
titudes (SSHA) for measuring relevant attitudes and study habits and 
Tecommended its use for diagnosis, counselling, and prediction. One 
Of the major criticisms of the SSHA was that it yielded a single score 
Which did not contain much information about the strengths and 
Weaknesses of an individual in specific areas. To alleviate this 
Criticism, Brown and Holtzman (1965) revised the instrument and 
Classified the items on the basis of judgments by 15 experts into four a 
Priori scales; namely, Delay Avoidance (DA), Work Methods (WM), 
Teacher Approval (TA) and Education Acceptance (EA). A Study 
Habits (SH) score is then obtained by combining scores on DA and 
WM scales and a Study Attitudes (SA) score is obtained by combining 


Copyright © 1975 by Frederic Kuder 
835 


836 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT | 


scores on TA and EA scales. Finally, a total score of Study Orienta- | 
tion (SO) is obtained by further combining the SH and SA scores. The 
combination of scales was justified on the basis of intercorrelations ` 
among the four basic scales for a sample of 3,054 college freshmen 
(Brown and Holtzman, 1966). For instance, scores on DA and WM 
scales correlated .70 with each other and scores on TA and EA scales 
correlated .69 with each other. Other correlations were: DA vs TA = 
49, DA vs EA = .65, WM vs TA = .53, WM vs EA = .62. | 
Although not explicitly stated, the authors seem to have forwarded 
the notion of a hierarchy of academic attitudes and study habits, This 
hierarchy would have the SO factor as the top of the pyramid, the SH _ 
and SA factors as second tier, and the DA, WM, TA and EA scales as 
the branches off to the third tier. It was the purpose of the present 
study to empirically assess the validity of the a priori scales and to test 
the hypothesis of a hierarchical structure of attitudes and study habits. | 


Method 


The sample for the study came from senior classes in four high 
schools(N = 243) and from freshman classes in two universities (N = 
603) in Ontario. Each respondent was asked to indicate whether а 
Statement on the SSHA was rarely (0 to 15% of the time), sometimes 
(16 to 35% of the time), frequently (36 to 65% of the time), generally 
(66 to 85% of the time), and almost always (86 to 100% of the time) 
true for him. The five options of rarely to almost always were assigned 
numerical values from 1 to 5 respectively. Data for the school and uni- 
versity samples were combined because the SSHA, Form C is in- 
tended for both high school seniors and college freshmen. 
‚ Pearson product-moment correlations were obtained among the 
items by assuming that the response scale was continuous and that 
responses to each item were normally distributed in the population 
from which the the Samples were drawn, To determine whether 
covariation among responses сап be explained by the four a prior 
Scales, the inter-item correlation matrix was analyzed by the method 0 
principal components, Four components associated with the first four 
eigenvalues were transformed to simple structure by the normalized 
varimax procedure and the resulting transformed solution was PSY” 
chologically interpreted by noting the proportion of items which had 
loadings (arbitrarily selected as .35) higher than the critical value ОП 
the appropriate factors. i 
The same question was examined in another way. A factor matri 
was Constructed by placing 1%, — 1% and 0’s as elements of the matrix 
according to the classification of items in each scale and whether 4? 
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item was positively or negatively worded. The observed factor matrix 
was transformed orthogonally to similarity to the constructed factor 
matrix by a procedure from Cliff (1966). Coefficients of congruence 
(фра) between the corresponding factors of the constructed and the 
observed factor matrices were obtained in order to determine the ex- 
tent to which a priori classification of items into scales was confirmed 
| by the data. 

To test the hypothesis of hierarchical structure, higher-order factors 
were derived by factoring the correlations among the lower-order fac- 
tors. In order to facilitate psychological interpretation, the final matrix 
consisting of the loadings of the variables on the initial and higher- 
order factors was transformed to an othogonal position by using the 
procedure described by Schmid and Leiman (1957). 


Results 


The results of the varimax and Schmid-Leiman transformations of 
the initial factors of SSHA appear in Table 1. The factors under 
Arabic numbers are the varimax-transformed while those under 
Roman numerals resulted from the Schmid-Leiman transformation of 
the second-and first-order factors. The factors have been arranged 
such that they correspond to the a priori scales. The description of the 
varimax results follows immediately while that of the Schmid-Leiman 
transformation is included in the discussion of the hypothesis 
regarding the hierarchical structure of attitudes and study habits. 

According to how the SSHA is constructed in terms of order of items 
and the corresponding scales they refer to, it is expected that the first 
and every fourth item thereafter should have higher loadings (.35 or 
higher) on the first factor (DA), the second and every fourth item 
should have higher loadings on the second factor (WM), the third and 
every fourth item should have higher loadings on the third factor 
(EA). An examination of the loadings on the first factor indicates that 
16 out of 25 (64%) items satisfy the above criterion. There are 17 out of 
25 (68%) items which load .35 or higher on the second factor. For the 
third factor, 20 out of 25 (80%) items meet the criterion of having .35 
or higher loadings. The proportion of appropriate loadings on the 
fourth factor is negligible (12%) compared to the other three factors. 

Interpretation of a factor is facilitated if the proportion of inap- 
Propriate loadings on the factor is relatively small. Theoretically, there 
could have been as many as 75 inappropriate loadings (.35 or higher) 
on each factor. A count of the inappropriate loadings yields 6.66% on 
the first factor, 1.33% on the second factor, 10.66% on the third factor, 
26.66% on the fourth factor. 
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TABLE 1 
Varimax and Schmid-Leiman Transformation of SSHA Factors 
Factors 
ITEMS P {б Н 2 I 3 Iv 4 
1 46 45 
2 50 
3 40 6 50 
4 40 
5 43 
6 6 66 
7. -4 —44 36 
8 
9 45 5217). 135 
10 41 41 —41 
П 4 54 38 
12. -4 -4 52 
13 -36 -3 49 
14 62 
15 35 
16 48 
17 47 
18 43 36 
44 43 LOL 
21 37 
22 
23 48 TEMAS 
24 48 %6 
5 39 -35 
6 55. 47 
27 42 50 
28:1 SO 41 
29 39 
В), 56 57 
50 
32 48 35 46 -38 
3 35 47 
53 48 
КЫШЫ 2 35 
37 53 9 4 
38 58 5 
9 на 70 56 
40 53 36 42 
disque 
35 
43 1 35 
44 38 59 44 
45 44 55.3] 36 
Ва camera 
47 35 
48 35 45 
49 58 59 38 49 37 
EL 65 59 
$2: 029g mou 
53 50 65 
54 M 


49 


50 
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i b. 55 35 

56 

57227567 5260721438 Е 
58 (ii 
59 39 26. d] 

COE MO Eae at 

61 37 520210397 

62 Е 56 52 

(pes ` 45, 

64 40 E 

65 i 
66 37 60 

67 , 

68 

69 

% 43 45 50 
11 38 Ex 

72 
73 —35 a 
74 461,79 

75 

76 

77 

ds 53 
79 

80 

81 54 63 42 

82 46 46 

83 36 Pians 

84 

85 -9 -4 n 
86 40 38 

87 

88 37 атлар 51 
$ C = 

= 47 53 43 
91 35 6 

92 

93 

94 38 46 

95 37 B 

96 49 4 

97  —38 i a 
98 48 39 

99 —38 
100 ita 43 
Арр. 20 
app ү d 80 12 
Inapp. 5 1 e 2 
% ге 133 10.66 26.66 
Фа 54 т 166 шй 


* Decimal points omitted and loading less than .35 not reported. 
ү Transformed varimax factors. 
Transformed Schmid-Leiman factors. 
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The results of the analyses of transformation of the observed factor 
matrix to the constructed matrix yielding coefficients of congruence of | 
.54, .71, .66, and .28 for the four factors. The results of the simple 
counting of appropriate and inappropriate factor loadings are sub- 
stantiated by the magnitude of the coefficients of congruence. The | 
sampling distribution of the coefficient of congruence is not known 
which makes it difficult to determine whether the above values are 
statistically significant. Evans (1970) suggests that coefficients of .90 or 
higher indicate "good" correspondence, coefficients from .80 to .90 
show "fair" correspondence, coefficients from .70 to..80 indicate 
"poor" correspondence between a pair of factors. According to this 
rule of thumb, there seems to be poor correspondence at best between | 
the hypothesized and observed factors, however, this rule may be too 
stringent to apply to an item factor matrix because correlations among 
single items may not be expected to be as high as correlations among | 
tests or subtests containing a reasonable number of items. Keeping ' 
this in mind, the results of similarity analysis may be interpreted to 
mean that the data generally support the a priori classification of items | 
for three out of the four scales. 

The results of the factoring of the four first-order factors do not con- 
firm the type of hierarchical structure proposed for study habits and 
academic attitudes. The group factors of SH and SA did not emerge; 
instead, a general factor (Study Orientation) resulted from the analysis 
of inter-relationships among the four factors. The present results sup- 
port a two-stage rather than a three stage hierarchy of academic at- 
titudes and study habits as measured by SSHA. 

The general factor is mainly composed of items from the DA and 
TA scales. The WM and the fourth scale seem to be specific and have 
not as much in common with the general factor as the other two scales. 
This. Observation is reflected in the intercorrelations and the factor 
loadings of the first-order factors on the general factor presented in 
Table 2. For instance, the general factor explains 61% and 53% of the 
variance in the DA and TA scores respectively compared to 27% and 
18% of of the variance that is common to the general factor and WM 
and the fourth factor respectively. Г 


Discussion 


‘ The results tend to Suggest that a priori elastificatiom of items hold 
or DA, WM, TA, scales but not for the EA scale, An analysis of the 
content of items which load on factor 4 reveals a tendency to apply 
One's self seriously to systematic studying and to doing assignments 
These items were judged to belong іп the DA and WM categories. № | 
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TABLE 2 
Intercorrelations among First-Order Factors and Their 
Loadings on the Second-Order Factor 


First-Order Factors 1 2 3 4 I? 
] Delay Avoidance 4l 52 —41 78 
2 Work Methods 44 -10 52 
3 Teacher Approval -31 73 
4 Academic Diligence* —42 


* Decimal points omitted. 
"Second-order factor. 
* Renamed. 


“ 


seems that these items tap motivational characteristics rather than 
modes of studying and factor 4 may be interpreted as a measure of 
"academic diligence." 

— Twelve of the items, which were supposed to have loaded on an EA 
factor, did not load on any of the four factors. In an earlier study 
(Khan, 1969), almost similar items were hypothesized to measure an 
"attitude toward education" factor for Junior High School students. 
Such a factor did not emerge in separate analyses for both males and 
females. Most of the remaining items classified in the EA scale have 
Appeared on the TA factor with a few items loading on the DA factor. 
A scrutiny of these items indicates that there is an implicit or explicit 
reference to the teacher, and the respondents have interpreted these 
statements in relation to the teacher as the stimulus object. 

The intercorrelations among the subscales reported by Brown and 
Holtzman (1966) do not seem to justify a three-stage hierarchy of 
school-related attitudes and study habits as measured by SSHA. The 
Correlations between the subscales making second-order SH and SA 
Scales are large enough to indicate the presence of a general factor. The 
Present findings confirm such an expectation. 

___ Brown and Holtzman (1966) һауе emphasized the value of the four 
Subscales in diagnosis and counselling by an analysis of an individual's 
Tesponses to statements in each subscale. The results of the present 
“Study һауе not supported the existence of ап“Едисайоп Acceptance 
Scale and further work on the items making up this scale may be neces- 
Sary before it is recommended for use in the evaluations of students’ 
Attitudes toward education and counselling based upon these evalua- 
ns. The present results also indicated that the “Study Habits” and 
- "Study Attitudes" scales do not follow from the basic subscales. 
ough a counsellor may be interested in knowing whether a student 
Needs help in the area of attitudes or study skills, it is doubtful if this 
information is provided by adding scores on the “appropriate” sub- 
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scales of SSHA. It is suggested that scores on the EA subscale and on 
the SH and SA scales be interpreted with caution because the validity 
of these subcomponents of SSHA is questionable. 
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A MEASURE OF RELIABILITY USING 
QUALITATIVE DATA' 


MENI KOSLOWSKY AND HOWARD BAILIT 


Department of Behavioral Sciences and Community Health 
University of Connecticut Health Center 
Farmington, Connecticut 


In many research activities, the data is unordered or qualitative. In 
such circumstances, inter-rater reliability is usually measured by 
calculating a percentage of agreement score between judges. The pre- 
sent paper expands on an equation first introduced by Goodman and 
Kruskal for obtaining a reliability measure of one item. This formula 
determines inter-rater reliability for a series of items across many sub- 
jects. The statistic that results is easily interpreted and in many ways 
is analogous to the conventional reliability for quantitative data. 


IN many types of research activities, it is necessary to obtain a 
reliability measure for qualitative or unordered data. The procedures 
that are presently available cannot handle such data using the classical 
reliability measures. Finn's (1970) method assumes interval type data, 
and Goodman and K ruskal's (1954) formula for handling reliability of 
unordered data is good for only one item at a time. This paper expands 
оп the latter's formulation and discusses an approach for calculating 
the reliability of a series of items. Іп this way, the procedure is 
analogous to the usual reliability determination for an achievement 
test or an attitude scale. 


Method 


The Goodman and Kruskal formula states that a measure of 
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reliability in the unordered case is: 

У! Раа — КРМ. + Р.М) 

“1- ҚРМ F.M) а) 


where, Раа represents the proportion of elements along the diagonal 
of a contingency table, and PM., P.M. represent the marginal propor- 
tions associated with the modal category for rows and columns, 
respectively. 

Thus, if two judges classified 100 people into the following con- 
tingency table according to their neurotic symptoms, the conventional 
reliability formulas would not be applicable (see Table 1). 

It is obvious that the three classifications of neuroses do not form ап 
ordered scale. For this type of data, Ағ provides useful insight into 
gauging inter-rater reliability. 

Goodman and Kruskal say the formula may be interpreted “а the 
relative decrease in error probability as we go from the no information 
situation to the other-method-known situation" (p. 758). In our case, 
as Ar increases the probability of judge II making the same assignment 
as judge I increases. This measure yields a statistic with much more 
information than the usually reported “proportion of agreement." 


А = 


The Case of Several Items 


It is Possible to extend the Goodman-Kruskal formula to include a 
Series of items, Thus, in the previous example, two judges may be re- 
quired to assign a group of individuals to one of three categories based 
9n а series of independent skills or abilities. \’r would then represent 
ап average of individual \v’s across М items: 


1 У) Раа — КРМ. + Р.М) 


ра» ет Зе gai во (2) 
М“ 1 — КРМ. + Р.М) 
TABLE 1 
The Use of the Goodman-Kruskal Reliability Formula 
ee 
Judge I 
Obsessive ^ Hysterical Phobic Total 
Obese ия 
sive 20 30 
Judge Hysterical 10 5 з 40 
1i Phobic 5 5 20 30 
Total 35 40 25 
ars A040) 


1- %(.40 + .40) 
С 40+ 40) — 9 
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The disadvantage of formula (2) is that A'r becomes indeterminate 
when both judges agree and assign all subjects to the same category 
for one or even several items. However, since we are concerned with 
similarity in judges' ratings, and not item discriminability, a score of 
“1” can be assigned in this case. An example will clarify the use of the 
formula. 

In a recently completed investigation of a new procedure to evaluate 
the quality of dental care that patients receive, a 3-point scale 1, 3, and 
9 was devised. The value “1” was assigned by a judge if he felt the den- 
tist had failed the requirements of a certain criterion, 3 was assigned if 
the dentist had passed the requirements of a certain criterion, and 9 
was assigned if the judge could not make a decision. The usual 
reliability formulae are inappropriate because “9” in this data has no 
ordinal relationship to 1 and 3. All patients were seen by two judges 
and their dental treatment was rated on a total of 29 criteria,’ each 
involving an independent technical skill. 

Table 2 presents results from this investigation. Along the rows are 
the item numbers and the columns contain, for each item, the propor- 
tion of agreement, the modal proportion for Judge I and the modal 
proportion for Judge II. The calculated A'r equals .27. This indicates 
that the probability of an error associated with just guessing the modal 
class (chance level) for each item has been decreased by 27%. Thus, if 
one guessed Judge II’s response by following Judge Г’з classification 
eee he would be better off than guessing the modal category for 

ийде II, 


Discussion and Implication 


Many of the limitations associated with Ar are also associated with 
Nr. (For a full discussion of these see the Goodman and Kruskal 
Paper). As is evidenced from Table 2, the restriction on the 
denominator for an individual item does not invalidate the procedure, 
When the degree of comparability between judges is being assessed, 
Perfect agreement is the optimal result. di 

The meaningfulness of a descriptive measure such as Хғ is quite 
Apparent. The information obtained from this statistic can be used in 
the same way that decisions are made after a conventional reliability 
determination. As A'r increases and approaches unity, more and 
More confidence can be placed in the present classification scheme. 
However, as A'r decreases, the assertion of similarity between the 


Е 
* The description presented here has been abbreviated in order to present only those 
Aspects relevant to this article. ы 
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TABLE 2 

An Example of X'r as an Indicator of Reliability of Several Items 
Item ХРаа РМ. Р.М 
1 .846 923 923 
2 923 846 923 
3 769 .846 .615 
4 1.000 1.000 1.000 
5 1.000 1.000 1.000 
6 1.000 1.000 1.000 
7 923 1.000 923 
8 923 846 923 
9 846 923 769 
10 .846 1.000 846 
П 1.000 1.000 1.000 
12 1.000 1.000 1.000 
13 1769 846 923 
14 1.000 .769 769 
15 923 1769 692 
16 1.000 1.000 1.000 
17 923 1.000 923 
18 846 846 769 
19 692 769 769 
20 1.000 1.000 1.000 
21 1.000 1.000 1.000 
22 1.000 .923 923 
23 .846 .923 923 
24 923 1.000 923 
25 923 769 846 
26 1.000 769 769 
27 1.000 .538 461 
28 1.000 769 769 
29 769 .692 923 


two judges’ ratings becomes more and more untenable and puts the 
use of the classification scheme into question. 
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RELIABLE DIMENSIONS FOR WISC PROFILES' 


ANTHONY J. CONGER* 
University of North Carolina 


JUDITH COHEN CONGER 
Duke University 


Measures of multivariate reliability are calculated for profiles of 
WISC subscales on three age groups. Profile dimensions based on 
reliability considerations are also established and matched across age 
groups and with factor analytic dimensions. While all possible 
differences among individual subscales are quite unreliable (about 
.51), a reduced set of five uncorrelated dimensions can be found with 
a more satisfactory reliability (about .87). In unrotated form, the 
maximally reliable dimension is essentially total IQ and the second 
maximally reliable dimension closely resembles a verbal- 
performance contrast. Four of the five rotated dimensions give à 
2004 match across age groups and with Verbal Comprehension, 
Relevancy, Perceptual Organization and Maze-specific factors. 
Guidelines for the interpretation and use of WISC subscale profiles 
are provided for both clinical and research uses. 


THE use of intelligence tests for other than general intellectual 
diagnoses has been criticized. in recent years (Anastasi, 1968); un- 
daunted, clinicians have continued to use various combinations of in- 
telligence test subscales to estimate differential abilities (Gainer, 1965), 
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diagnose specific deficiencies (Kallos, Grabow, and Guarino, 1961) 
and to assess various psychological traits such as anxiety and 
neuroticism, (Rashkis and Welsh, 1964). This is partly due, one 
suspects, to the “clinical tradition” as well as the availability of 
numerous subscale scores on instruments such as the Wechsler scales, 
Some of the diagnostic applications are based on criterion group com- 
parisons while others are based on factor analyses (e.g., Lutey, 1966). 
Based on their own literature survey, as well as relying heavily on the 
work of Lutey (1966), Robb, Berndardoni and Johnson, (1972) present 
a broad comprehensive approach to the diagnostic use of WISC and 
WAIS subscale profiles which may become the “standard” for 
diagnostic applications; however, in order to make valid discrimina- 
tions among individuals (or groups) one must first confront the 
problem of profile reliability. Unfortunately, Robb et al. have not con- 
sidered the overall reliability of the WISC subscales, nor the reliability 
of the differences of the separate profile dimensions which underlie 
their approach. In fact, most uses as well as users of profiles have ig- 
nored the issue of the reliability of profile differences (perhaps with 
g00d reason, in that techniques for the evaluation of profile reliability 
have not been generally available). 

Lutey (1966) and Robb et al., (1972) derived WISC and WAIS 
profiles based on various factor analyses of the subscales of these in- 
struments. Their method of forming profiles presents problems that 
are not readily resolvable. First, the scores that are formed are fre- 
quently overlapping (but are not weighted so as to be indpendent). 
This results not only in highly correlated factor scores, but also results 
in correlated errors of measurement. For example, for the 7/4 year age 
group, one factor score is formed from the sum of the WISC Informa- 
tion, Arithmetic, and Vocabulary subscales; whereas a second factor 
Score is formed from the sum of Information, Arithmetic, Vocabulary 
and Comprehension subscales. The only valid difference between these 
two factor scores would be due to Comprehension! A second problem 
is that because of the large number of factor scores provided by the 
Robb et al. approach (9 common factors), strong dependencies exist 
among the overall set of factor scores, For example, for the 13/2 year 
age group, their "G" factor score can be derived from the sum of their 
verbal comprehension factor and anxiety (or numerical) factor minus 
their fluency factor (G = VC + A:N — Е). Hence, although Robb et 
al. obtain nine separate factor scores for this age group, there are df 
most eight independent scores; One consequence of these aforemen" 
tioned problems is that all Possible differences among their factor 
scores might be less reliable than desired. More deleterious however, is 
that because of linear dependencies and correlated errors of measure 
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ment, the actual reliability of their scores can not be unambiguously 
ascertained. 

Based on recent statistical developments (Conger and Lipshitz, 
1971, 1973; Conger, 1974) a general measure of profile reliability is 
available and can be used to establish the reliability of all possible 
profile (or subscale) comparisons and can also be used to establish a 
set of independent and maximally reliable profile dimensions. This 
technique is not unlike the approach discussed by Bock (1966), and its 
application to instruments such as the WISC is recommended by Cron- 
bach, Gleser, Nanda, and Rajaratnam, (1972). A measure of overall 
profile reliability is useful for establishing the level of confidence one 
can have in stating that any observed differences between the profiles 
of two individuals is a reliable difference. More importantly, however, 
the most reliable subscale composites would probably provide a better 
basis than factor analysis for general diagnostic applications because 
the former approach eliminates unreliable information but retains 
reliable specific factor information. For the discrimination of clinical 
subgroups actual validity studies are, of course, requisite; however, to 
the degree that validity is limited by reliability, the most reliable com- 
posites would again provide a better basis for such studies. The pur- 
pose of this paper is to determine maximally reliable profile dimen- 
sions for the WISC which could be used in various validity studies, 
and to provide a measure of WISC profile reliability that can serve as а 
guide for the clinical interpretation of profile differences. 


Method of Analysis 


Overall profile reliability can be determined for either of two general 
approaches to profiles. One method for handling profile differences 15 
Cronbach's D? and another, that is generally accepted, is 
Mahalonobis’ D?, The former profile differences are easier to calculate 
but are less tractable, in a statistical sense, than the latter, If we let X, 
represent a vector of scores for person i and X represent a vector of 
Scores for person j (or a "target" profile) the Cronbach distance 
measure is 


Dy? = (Xi = X) Ой — Xj) 
Orin terms of the individual subscales, 
Ж 2 
р“ = У (Хь — Ха) 
k=l 


Where X,, is the score of person i on subscale k (there are K subscales). 
The Mahalonobis distance measure for the same scores is found from: 
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Dif = (X, — X)! Z^ X, — Xj) 


where Уу, is the inverse of the convariance matrix for the entire set 
of individuals. In terms of subscale scores, 


K K 
Вы’ = D >, (ta — ХХ — Xedos! 
where o,"** is the kk*th element in Х,,-!. If standardized scores are 
used, a correlation matrix is employed instead of a covariance matrix, 
but the same D* value will be obtained. 
Profile reliability for the Cronbach distances is simply the average of 
the univariate reliabilities (Conger and Lipshitz, in press): 


Moins (1) 
p» = K рь 


where p, is the univariate reliability for subscale К. Profile reliability 
for the Mahalonobis distances is found from 
2 


pp? = i Trace (А,,*А,,7') d 


where Ку! is the inverse of the correlation matrix among the sub- 
scales and К„* is the correlation matrix R,, with univariate 
2. substituted into the diagonal (Conger and Lipshitz, 1971; 

Although Conger and Lipshitz (1973) have shown that the Cron- 
bach distances are always more reliable than Mahalonobis distances 
(unless all variables are uncorrelated, in which case both reliabilities 
are equal), they caution that the choice of these profile distance 
Measures depends on the desired use. The Cronbach distances 
emphasize common factors of the subscales whereas the Mahalonobis 
distances allow more weight for independent contribution, but 
simultaneously allow more weight for unreliable subscales. The real 
strength of the Mahalonobis approach is that a reduced set of dimen- 
sions (or composites) can be found which are uncorrelated with one 
another and which have maximum reliability. 

These maximally reliable and uncorrelated profile dimensions are 
found by solving the eigenvalue-eigenvector equation 


Ru* Vy = y RAV, (3) 


Where И, is termed a canonical vector and provides the weights for the 
ЛЬ dimension and y, is a canonical root and equals the reliability of 
the jth dimension. The canonical dimension is simply a weighted com- 
bination of the initial scores, that is, a new score ¥, is formed for each 
individual i using each canonical vector И, as follows: 
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K 
Ya = У, Хайч, 
к=\ 
In fact, and this will be used below, апу composite (or dimension) 
score is simply a weighted combination of the observed scores. Thus, 
for example, to form the standard verbal-performance contrast, the 
five verbal subscales are weighted “+1” and the five performance sub- 
scales are weighted "—1"; any other scale (e.g., Digit Span) is 
weighted “0.” The verbal-performance contrast is thus: 


к 
Yio» = 2j Жа 
k=l 


Wy) = +1 if subscale k is a verbal subscale, 
—1 if subscale k is a performance subscale, 


= 
И 


E 
li 


=0 if subscale К is neither. 


The reliability of any composite scale can be found by: 


K K 
Wee Ине 
И В. ЖАМЫЛАТЫН; (4) 
А Ran Wa E 


к 
Y X) и. Ит 


kel keel 


The canonical profile dimensions have the additional characteristics 
that if the y,? are ordered according to magnitude (from larger to 
smaller), Y, is the most reliable composite score possible, Yi; is the 
most reliable composite score that is uncorrelated with Yu, Үз is the 
most reliable composite score uncorrelated with both Үй, and Ул», etc. 
Using these facts, a profile user can utilize whatever number of com- 
posites deemed by him to have sufficient reliability. Statistical tests for 
the reliabilities are also available if a test-retest ог parallel form ap- 
proach is adopted (see Bock, 1966 or Conger, 1974). If only the first 
ane dimensions are used, then the overall reliability for these dimen- 
Sions is 


2 1% 2 
Ро» и (5) 


If all canonical composites are retained, then 


Pon? = ро’. 


If the canonical composites are weighted according to their reliability 
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(more weight for the more reliable dimensions) then an alternative for- 
mula, due to Bock (1966), is available: 


тј? 
2 
г а T 


© ل =„ 


ici 1- y 
Bock's method of weighting yields profile dimensions which have 
equal standard errors of measurement, thereby allowing easier visual 


contrasts of profiles and is recommended for this purpose (see Conger, 
1974). 


Interpretation and Rotation of Canonical Dimensions 


While it is very tempting to assign psychological meaning to the 
canonical dimensions directly from the canonical vectors obtained 
from equation (2), it should be noted that these vectors contain the 
weights by which canonical scores are computed from the original 
variables and do not necessarily give the degree and pattern of associa- 
tion between the canonical dimensions and the original variables. That 
is, the vectors resemble a factor pattern and not a factor structure 
(Harman, 1967). The more appropriate matrix for interpretation is the 
matrix of intercorrelations between the original variables and the 
canonical variables. 

Intercorrelations between the canonical and original variables are 
found quite simply from the equation 


RS RLV q 


if the y's are standardized canonical dimensions (Conger's weights); 
however, Ray is invariant for the various weightings discussed above. 

Аз in factor analysis, the matrix of canonical “loadings” given 1n 
equation 7 might be simplified by rotation to a simple or "simpler 
Structure; however, the dimensions as they are extracted (equation 3) 
already Possess two desirable characteristics: (a) they are sequentially 
maximally reliable, and (b) they yield uncorrelated total, true and еге 
Tor scores (within each set), Any rotation, beyond a simple change In 
Scale using а diagonal transformation matrix, will destroy both of 
these properties to some degree. Whether the loss of these properties 
can be offset by a gain in interpretability can only be decided for each 
analysis on an individual basis, 

Transformations of the maximally reliable dimensions (Y's) can be 
accomplished by an orthonormal rotation matrix Q(Q'Q = 1), 


2-Ү0. (8) 
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where the Z's are rotated canonical scores. The rotated variables have 
a structure given by 


Rxz= RxyQ 
anda pattern 
W- VO. 
In this form the Z's are uncorrelated, 
Кз = Q'VR«VQ =] 
and the rotated true scores have a variance covariance matrix given by: 


Rry, = Q'V'RrrVQ = ОАО 


where A is a diagonal matrix of unrotated canonical reliabilities. It 
should be noted that the trace of Rrzrz is merely the sum of the 
unrotated canonical reliabilities; however, Rrzrz is not diagonal 
(unless Q is). The multivariate reliability in the rotated space does, 
however, remain invariant as can be shown by substituting Етул, апа 
Rz; into equation (2). 


Data 


Data have been taken from the WISC manual (Wechsler, 1949). 
Wechsler provides correlation matrices for 200 subjects at each of 
three age levels: 7/2, 10%, and 13% year olds. Wechsler also provides 
split-half reliability estimates for eleven of the twelve subscales for 
each age group separately (Coding, since itis a speeded test, can not be 
used). Only the eleven subscales for which reliability estimates are 
available are used in the following analyses. 


Analyses 

Univariate reliabilities and intercorrelation matrices were substi- 
tuted into equations (1), (2) and (3) to obtain overall estimates of pro- 
file reliability. The vectors obtained from equation (3) were then sub- 
Stituted into (7) to obtain the unrotated canonical structure. This 
Structure, for each age group separately, was then subjected to a row 
normalized varimax rotation on the retained dimensions. 


Results: Canonical Reliability 


If profile differences are calculated by the Cronbach method, the 


overall reliabilities are .68, .76, and .75 for the 7%, 10% and 132 age 
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groups respectively (equation 1). These overall reliabili 
compare well with the total test reliabilities of .92, .95, 
(Wechsler, 1949) but are quite reasonable in magnitude com 
the reported verbal-performance difference reliabilities of .68 
.84 as found from equation 5 for total Verbal versus total 
mance. 

The Mahalonobis difference reliabilities (equation 2) are . 
and .53 for the respective age groups. These values are subs 
lower than the Cronbach difference reliabilities and indicate t 
possible comparisons should not be made from the WISC 
but a dimensional analysis might provide useful informati 
which kinds of profile comparisons might be made. 


Results: Canonical Dimensions hid 


The roots obtained from equation 3 indicate that there are, 
five reasonably reliable dimensions, and more definitely, 
reliable dimensions within each age group. In all three groups th 
was a break in the eigenvalues after the first two roots and | 
what smaller break after the first five (Table 1). While 
abilities of (unrotated) dimensions 3, 4, and 5 might fall belo 
satisfactory magnitude for some test users (e.g., the reliabi 
dimensions 5 Гог 7) year olds is only .55) all five dimensions have b 
retained for interpretation and rotation. The overall reliabili 
first five dimensions are 71, .75, and .77 if equated on total varia 
83, .90 and .88 if equated on error variance. 


Interpretations: Unrotated Dimensions 


„The first unrotated dimension (Table 2) in all three age g 
high positive correlations with all WISC subscales. This pal 


TABLE 1. 
Canonical Vector Reliabilities Лог WISC 


Root 7% Year Olds 10% Year Olds 13% Year Olds 


1 94 97 
2 82 84 
3 65 л 
4 .60 66 
5 155 159 
6 44 54 
7 43 46 
8 3 4l 
E 32 27 
10 19 13 
11 09 08 


855 


CONGER AND CONGER 


saposqns DSIM YM suomuaunq рәплогип fo ,Suonpjo440) 


198 
856 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT | 


loadings indicates that the most reliable comparison am 
dividuals is a simple comparison of profile levels (i.e., total test 
Although the weights for the various subscales (Table 3) are по 
the reliability of the first canonical composite is not sufficiently di 
ent from total test reliability to warrant differential weighting 
subscales. | 

The second most reliable dimension (Table 2) closely resemb 
verbal-performance contrast with some differences in pattern (T: 


1959). [n all three groups, the reliability of this dimension exceeds th 
of the standard verbal-performance difference and should probably 
used in its place. 


a 


The results of the row normalized varimax are given in Tables4a 
5. The resulting structures of the various age groups were сот 


вгиелсе (Harman, 1967, р. 270). A good across age group fit was! 
tained for four of the five rotated dimensions; furthermore, these fc 
(1-4, Table 4) showed similar patterns of i 
both across and within age groups when cast inte 


i 


НА \ TABLE 3 
anonical Weights for First Two Maximally Reliable WISC Dimensions 


Scale 


Information 


First Dimension Second Dimension 
104 13%4 7% 104 В 


 mprehension ү = m HA 11 
Arithmetic 15 17 20 
Similarities al a2 т i 3 
Vocabulary 20 32 32 37 32 
Digit Span 09 04 ‘04 15 01 
Picture Completion .08 ‘05 06 07 ғ 1 1 Я 
Рісішге Arrangement 16 07 (08 (05 -(4 b 
Block Design 30 18 as -ss  —% 
Object Assembly 11 .05 7 -—46 0 ШЕ 


-19 10 07 -.47 -43 
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multitrait-multimethod (age groups) matrix (not shown). The inter- 
correlations of the rotated true scores are very near zero with average 
absolute values of .06, .04 and .04 and absolute maximum intercorrela- 
tions of .14, .11 and .10 for the respective 7%, 10% and 13% year age 
groups. 

The first rotated dimension (Table 4) corresponds closely to a factor 
of Verbal Comprehension (Robb, Berndardoni, and Johnson, 1972) 
and has across age groups ¢’s of .96, .91, and .94. The differences 
across age groups primarily involve differences in the Comprehension 
subscale (agreeing with factor analytic results) and Vocabulary (not 
agreeing with factor analytic results). The increasing loss of impor- 
tance of Vocabulary for increases in age is, however, offset by the sec- 
ond rotated dimension. 

The second dimension has loadings somewhat similar to the first, es- 
pecially for the younger age groups. The ф values of factor congruence 
are .90, .76 and .94. The loadings correspond somewhat to the “К” 
factor of Relevance in the 79 and 10% year olds, (Robb, et al.) which 
has high loadings for Comprehension and Vocabulary. Vocabulary 
tends to dominate the factor, but the other verbal scales also have high 
loadings. The R and Verbal Comprehension factors seem to be 
somewhat confounded in the 13% year age group. 

The third dimension has good agreements across age groups (97 of 
92, .90, and .94) and, especially for 13) year olds, resembles a Percep- 
tual Organization factor (Robb, et al.). In all three groups, this factor 
is dominated by Block Design (loadings of .92, .90 and .90). Devia- 
tions from the Perceptual Organization factor are primarily due to the 
absence of Mazes (see dimension 4) in all groups and low loadings for 
Picture Arrangement in 7% year olds and Picture Completion in 10% 
year olds (see dimension 5 for both groups). 

Тһе fourth rotated dimension is a very clear specific factor, having a 
high loading for Mazes (.94, .91, and .89) with good agreement across 
аре groups (ф’з of .89, .87 and .93). Mazes contributes very indepen- 
dent information to the rotated profile dimensions and thus could be 
added or subtracted to other dimensions with no loss in information. 
For example, a better Perceptual Organization factor score could be 
obtained by adding scores on dimension 4 to scores on dimension 3. 

The fifth rotated dimension shows some correspondence between 
Th and 13% year olds (ф = .76) but little agreement for other pairings 
Of the age groups (—.13 for 7% versus 10% and .26 for 10% versus 
13%). In the 7% year olds, this dimension resembles the Freedom from 
Distractibility factor but has little resemblance to that factor or any 
other major factor of 10% and 13% year olds. This ‘residual’ dimen- 
Sion has the lowest reliability of the rotated dimensions for each age 
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group, but its deletion would result in only moderate gains 
tivariate reliability (.01, .04, and .03 for the respective age 


Comparisons of Rotated and Unrotated Dimensions — 


The rotated dimensions (Table 4) do possess a reasonable 
structure and across age groups match with little loss in 
dependence of the underlying true scores. There is, however, ale 
out of the reliabilities of the rotated dimensions with respective n 
imum values of .76, .82 and .87. 

The unrotated solutions have only two dimensions and 
general factors which correspond closely to the major uses. 
WISC scores, i.e., overall IQ and Verbal-Performance differen 
reliabilities of each of these two dimensions exceed the maxim: 
rotated dimensions (with the exception of the Verbal Perfo 
contrast versus the Perceptual Organization factor in 13% yeal 
i.e., .86 versus .87). 

There seem to be advantages to both types of solutions and р! 
a choice between them should be predicated on the uses to whi 
WISC profile will be subjected. Clincial applications would 
favor retaining only the two major unrotated dimensio 
research purposes might be better served by the rotated dimensi 
one further considers that the Mazes subscale is frequently omitte 
during WISC testings, the advantages of the unrotated solutio | 
enhanced. It is possible, of course, to keep the rotated dimensio, 


to combine them as needed to form the level and Verbal-Perfoi 
contrasts. 


А Note on Unreliable Dimensions 


In the same manner that the most reliable dimensions indicate | 
reliable comparisons can be made among individuals, the leas 
dimensions indicate the comparisons that should not be made. 
ticular, the tenth dimension for 7% and 13% year olds and the el 
dimension for 10% year olds (Table 6) indicates that a composi! 
trasting Information with 
reliable (р =. 
7% and 13% year olds is also similar. 
dicating a 
(especially Information versus Vocabulary) and among perform?! 
Scales. To the extent that the verbal scales have high СО 
variance, differences among them would be very unreliable (the 
would be true for performance scales). The tenth dimension 19 
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TABLE 6 
Canonical Weights for Least Reliable WISC Dimensions 


713 Year Olds 10% Year Olds 13% Year Olds 
Scale Vio Vir Vio Vu Vio Vu 

Information 1.05 40 .28 1.40 1.42 63 
Comprehension -.45 68 07 27.97 —.89 46 
Arithmetic -.76 25 —.20 Е) 2431 —.28 
Similarities —.30 E E zT .20 -.07 
Vocabulary = —.54 —45 04 —.51 -.70 
Digit Span -.01 .63 96 27 —.02 56 
Picture Completion 46 —.30 10 -.07 11 -.62 
Picture Arrangement 23 ai —.24 16 27 —.04 
Block Design -.06 31 40 EAS eu —.61 
Object Assembly —.05 31 =m .58 —.26 1.18 
Mazes -.03 БАЛП! -.25 712] 1.23. —.24 
Reliability .19 .09 413 08 .29 06 


Object Assembly versus Digit Span and Picture Completion. Of par- 
ticular interest is that because these three subscales have very low cor- 
relations with the remaining scales, one would expect that large 
differences in profile patterns are à priori more probable, but 
simultaneously, less reliable. 


Summary 


The preceding results lead to the following recommendations con- 
cerning the use of the WISC subscales for profile comparisons: 
1. As far as "clinical" use of WISC profile comparisons is con- 
cerned, our advice is “caveat emptor." Pure intuition as to whether 
any two profiles differ or not will quite frequently capitalize on an un- 
reliable difference. If profiles are compared by the Cronbach D? 
method, the reliability of the differences will be good (around .73) but 
not outstanding. If all possible differences are allowed, the reliability 
of an "average" difference is quite unacceptable (around .51). 
2. While five uncorrelated dimensions are probably sufficient to es- 
tablish a set of reasonably reliable WISC profile dimensions as 
predicted by Cronbach (Cronbach et al., 1972); only two dimensions 
have an unambiguous and readily interpretable pattern. The clinician 
should probably restrict himself to diagnoses using these two largest 
dimensions (total IQ and Verbal Performance differences) until 
satisfactory reliability and validity are established for the other uncor- 
related dimensions. s 

It is also possible to establish four dimensions which are similar 
across age groups and which correspond to factor analytic dimensions. 
Three of these dimensions correspond to group factors of Verbal 
Comprehension, Relevance and Perceptual Organization while the 
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fourth is specific to Mazes. Thus, reliable individual differences can be 
found for these three group factors and the Mazes subscale can be con- 
sidered as providing reliable individual differences beyond that. These 
four dimensions might be quite useful for research purposes involving 
the WISC subscales when it is not considered feasible to utilize all 11 
subscales (е.р., for Multiple Regression or Analysis of Variance 
analyses). The advantage of these new dimensions over the subscales 
themselves and over factor scores is that these dimensions are all 
reasonably reliable, reasonably interpretable and are virtually in- 
dependent. Such a claim could not be made for the numerous factor 
scores discussed by Robb et al. (1972) or for the separate subscales 
themselves. 

3. The most reliable dimension for all three age groups involves ap- | 
proximately a simple sum of the subscale scores. While Vocabulary 
and Block Design do have relatively high weights, the gain in 
reliability of the more complex sum is insignificant by comparison: 
The second most reliable dimension is a Verbal-Performance contrast 
and in this case, two features should be noted. The weighted Verbal- 
Performance contrast does provide a gain in reliability over a simpler 
sum, and the weights differ for the different age groups. 

4. The least reliable dimensions indicate that comparisons should 
not be made among subscales that are highly correlated relative 10 
their level of reliability. In particular, Information versus Compreher- 
sion plus Arithmetic is a source of very unreliable individual 
differences, 

_ Based on the above results, a general statement seems warranted. It 
is no great surprise that the WISC functions best for the purpose for 
which it was designed, i.e., large amounts of time and money were ex 
pended to develop an instrument that would “measure” general ability 
and verbal and performance abilities. The WISC serves these purpos® 
rather well. What is surprising is that psychologists take such an If 
strument and combine and permute the subscales in n-different ways 
with little concern for such test fundamentals as reliability and 
validity. There seems to be а type of logic involved that asserts: “sif the 
instrument is good in Serving one purpose, it, therefore, must be goo 
in serving any purpose.” As Peterson (1968) asserts, albeit ІП 2 


Somewhat different context, “there is no cheap way to study huma" 
behavior." 
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PAIRED COMPARISONS INTRANSITIVITY: 
TRENDS ACROSS DOMAINS OF CONTENT AND 
ACROSS GROUPS OF SUBJECTS 


DARWIN D. HENDEL 


Measurement Services Center 
University of Minnesota 


The present study investigated the extent to which intransitivity, as 
measured by the total circular triad (TCT) score in the method of 
paired comparisons, varies across different domains of content and 
across different groups of subjects. Three paired comparisons instru- 
ments were administered to 276 high school students and to 358 col- 
lege students. Results indicated statistically significant (p S .001) 
correlations among the three TCT indices for both groups of sub- 
jects. 


THE method of paired comparisons, one application of Thurstone’s 
(1927) “law of comparative judgment,” can be used to obtain pref- 
erences for a set of stimulus objects. In addition, the method yields an 
index of response intransitivity which has been termed the “total cir- 
cular triad” score. 

One of the aspects of the total circular triad score which has been 
questioned concerns the generality of response intransitivity as 
measured by the total circular triad score. Does intransitivity occur 
across instruments or is intransitivity instrument specific? 4 Ро 
relationships among intransitivity indices for one group of subjects 
Teplicate for other groups of subjects? : 

Although previous investigations (Gulliksen, 1964; Ace and Dawis, 
1972) have obtained intransitivity indices for more than one instru- 
Ment, the generality of intransitivity across subject groups has not 
been established. The purpose of the present study was to investigate 
the generality of intransitivity across widely different domains of con- 


tent and across different groups of subjects. 


Copyright © 1975 by Frederic Kuder 
865 


866 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 
Method 


Two groups of subjects were used in the present investigation. The 
first group consisted of 276 students from Cooper Senior High School 
in the Robbinsdale (Minnesota) school district; the students par- 
ticipated in the questionnaire session as part of the class work in 
vocational education. The mean age was 17.1 years (SD = .67); 96 
(3595) of the students were males and 180 (65%), females. The second 
group consisted of 358 students enrolled in Introductory Psychology 
at the University of Minnesota; students received experimental points 
for participating in the questionnaire session. The mean age was 19.9 
years (SD = 2.48); 168 (47%) of the students were males and 190 
(53%), females. 

Three paired comparisons instruments were used to obtain the total 
circular triad indices of response intransitivity. Each instrument con- 
tained 20 statements in a complete paired comparisons format which 
resulted in a total of 190 paired comparisons items [n(n — 1) items 
where n equals the number of statements]. The first instrument used 
was the Minnesota Importance Questionnaire (MIQ; Gay, Weiss, 
Hendel, Dawis and Lofquist, 1971) which contains statements of 
vocational needs (e.g., “I could do something that makes use of my 
abilities"). The second instrument, designed by the author, was the 
Mate Selection Questionnaire (MSQ), which contains qualities 
presumed to be used frequently in choosing a mate (e.g., “physical at- 
tractiveness"). The third instrument, designed by the author, was the 
Food Preference Questionnaire (FPQ) which obtained preferences for 
main course meals (e.g., “hot beef sandwiches"). For each instrument, 
the subject indicated his/her preference between pairs of responses. 

Total circular triad scores were calculated according to Kendall's 
(1955, p. 125) formula: 


TCT = 1/6л(п — 1)(2n — 1) — » Х./2 


where the Xi represent an individual's scale scores on the п stimulus 
objects. The 20 scale scores for each of the three inventories reflected 
the number of times each of the stimulus objects was preferred in the 
total set of 190 items, Scale scores could range from 0 (statement was 
never chosen in any pair) to 19 (statement was chosen every time it aP- 
peared in a pair). 

Pearson product-moment correlations (Guilford, 1956, p. 95) Wer? 
calculated among the TCT scores on the MIQ, MSQ and FPQ for the 


high school and college student groups separately. 
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TABLE 1 
Iwercorrelations of Total Circular Triad Scores on the МТО, MSQ. and FPQ for High School and 
College Student Groups 


Group Intercorrelations* 5% S.D. 
High School Group 
(N = 234) TCT onMIQ  TCTonMSQ ТСТ оп ЕРО 
TCT on MIQ 91.27 71.20 
ТСТ оп М5О 458 61.16 49.26 
ТСТоп ЕРО 3a 39" 3418 35.97 
llege Student 
; Group 
(М = 348) TCT оп МЮ TCT оп MSQ ТСТ оп ЕРО 
TCT on MIQ 52.39 40.42 
ТСТ оп MSQ 139997 35.34 27.65 
ТСТ оп ЕРО 259% 234 24.06 23.42 


ts necessary for significance at the .05, 01, and .001 levels of significance 


*Minimum Pearson product-moment correlation coefficient 
232 df) and 105, 138, and .174 respectively for the college student group 


bó 166, and .210 respectively for the high school group ( 
"ps 05. 
Mrs 01. 
pps 01. 


Results and Discussion 


Table 1 contains the Pearson product-moment correlations among 
the three TCT scores for both subject groups. The correlations were 
significant at p < .001 for both the high school and college student 
groups. Although the level of the correlations was consistently higher 
for the high school group, the pattern in the correlation matrix Was 
similar for both groups. 

Considering the TCT scores as indices of response intransitivity, the 
obtained correlations suggest that the tendency to respond intran- 
sitively does generalize across different domains of instrument content 
and across different groups of subjects. 

The results suggest that intransitivity is 
Intransitivity can be considered as an indicator of true differences 
among individuals rather than as an instrument specific index of error 
in responding to paired comparisons instruments. However, intran- 
sitivity is not totally generalizable from one content domain to 
another. Although an individual who is intransitive on one instrument 
is more likely to be intransitive on other instruments, specific sets of 
stimuli probably interact with the individual's basic tendency to re- 


spond intransitively. 


not а random error variable. 
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AN ANALYSIS OF THE MEANING OF THE 
QUESTION MARK RESPONSE CATEGORY 
IN ATTITUDE SCALES' 


BERNARD DuBOIS? AND JOHN A. BURNS 
Northwestern University 


Although most scaling formats include an intermediate or neutral 
= response category, little research has been devoted to the analysis of 
the meaning respondents attach to this category. Results obtained 
from ten different scales, across two types of item formats (Likert 
апа Polar Choice) support the traditional method of scoring the?” 
answer. Although the meaning respondents imply when selecting the 
“2” is not more ambiguous than the meaning implied in the selection 
of the other response categories, there does exist evidence for the 
presence of a variety of uses of the “?” including response styles, am- 
bivalence and indifference. Various suggestions are made for further 
research and alternate methods of approach to the meaning of the 
question mark response category. 


ALTHOUGH a multiple-indicator approach to the assessment of 
Social attitudes has been advocated by various authors (Cook and Sel- 
ltiz, 1970; Webb, Campbell, Schwartz, and Sechrest, 1966; Summers, 
1970) most current research still relies on the time-honored self-report 
technique of attitude scales. A bead 

An attitude scale represents an attempt to measure an individual 8 
dispositions toward a given issue by asking him to express his degree 
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of agreement or disagreement to a group of statements. Although a 
variety of scales are found in the literature (e.g., Thurstone Equal Ap- 
pearing Interval Scale, Guttman Scalogram), the type most frequently 
employed is the summated rating scale (Likert, 1932), more commonly 
known as the Likert scale. In a Likert scale, the format of responses to | 
an item traditionally consists of one or more levels of agreement (e.g., 
"Strongly Agree,” Agree"), one or more levels of disagreement (e.g. 
"Strongly Disagree," **Disagree") and one level of indecision or in- 
determination (the question mark (“?”), “undecided” or “neutral” 
response category). 

In the psychometric literature a substantial amount of research has. 
focused on the properties of the entire scale (validity, reliability, etc.), 


or on the properties of individual items (representativeness, ambiguity, | 


etc.). However, investigators generally do not examine the meaning of 
each item's response category, regarding these meanings as relatively 
unambiguous: "Strongly Agree" means that the respondent agrees 
more strongly than just agreeing; “Strongly Disagree" means disagree- 
ing more strongly than just disagreeing: the neutral point means just 
that—in the middle. Subsequently, the researcher typically assigns 
scores implying that the response categories are on an interval scale. 
For example, he would use the following scoring system: “Strongly 
Agree” = 5; "Agree" = 4; 9” = 3; “Disagree” = 2; “Strongly 
Disagree” = 1. It is always assumed that the respondent uses the same 
set of meanings as the researcher. 

Although Likert (1932) had originally presented data which sup- 
ported this scoring system, more recent research (cf. Stanley and 
Wang, 1968; Wang and Stanley, 1970) indicates that these assump- 
tions may not be valid. For example, due to the item’s wording, 
respondents may actually be interpreting “Strongly Agree” as being in 
the region of a score of six or seven rather than five; they see it as being 
er Strongly Agree,” 1.е., more than just one equal interval from 

Agree. Various post hoc statistical methods have been developed 
which generate item response scores which are more sensitive to the 
don Tespondents attach to their responses (e.g., MacDonald, 

This increased awareness on the part of researchers to the meanings 
of individual Tesponses is a welcome initiative. Researchers need to 
consider more than just the score obtained on each item—they need to 
understand how the subject interprets each item and its response alter- 
natives before the difficulties involved in measuring attitudes with pre- 
coded instruments can be overcome. In general, too many aspects 0 
the individual items of attitude scales are left untested. 

The problem of the meaning and appropriate scoring of the “2” 
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response category is certainly such an aspect. According to the small 
amount of research conducted on this topic (Cronbach, 1946; 
Edwards, 1946; Worthy, 1969; Golberg, 1971; and Kaplan, 1972), the 
essential meaning of the neutral category or question mark is either 
one of ambivalence or indifference. 

The ambivalent respondent has “mixed feelings," i.e., positive and 
negative sentiments concentrated on the same object (Brown, 1965), 
and cannot make up his mind as to whether he agrees or disagrees 
with the proposed statement. To express his internal state of indeter- 
mination, he marks the “2” answer. Ambivalence is often the result of 
over-involvement with the issue under analysis. Well familiarized with 
the pro and con arguments, the ambivalent respondent finds it difficult 
to make a final choice. 

The indifferent respondent checks the “2” response because he has 
minimal concern for the topic involved in the statement. While am- 
bivalence often results form overinvolvement, indifference generally 
indicates underinvolvement. Operationally, it therefore may be possi- 
ble to separate ambivalent from indifferent respondents if one has data 
available concerning the respondents’ level of involvement. It is 
hypothesized that the ambivalent respondent should exhibit a high 
level of attitude intensity while a low level of intensity should 
characterize the indifferent respondent (cf. Diab, 1965). 

Ambivalence and indifference, however, are by no means the only 
two factors which account for the use of the “2” response category. 
Some respondents, for example, may check the “2” mainly because, 
although not indifferent, they do not feel competent enough or suffi- 
ciently informed to take a position. Others might use it as an indirect 
way of expressing their refusal to reveal their personal feelings. Still 
others might use it because they do not understand the attitudinal 
statement. k 

Various positions have been taken on the scoring of the question 
mark. At one extreme, some researchers, for example. Cronbach 
(1946), have suggested that given the ambiguous meaning of this 
category, attitude researchers should probably discontinue its use. 
Other researchers have adopted the reverse approach by making the 
meaning of the “2” more specific. Goldberg (1971) for example sug- 
Bests the following format to supplement the “2” answer: 

Note: Where you have indicated your feelings about one of the 
above statements by circling a question mark, please go back and 
write in one of the following next to the question mark: 


I Indifferent 
M Mixed feelings 
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DK Don’t know 
О Other; please specify your feelings 


Most attitude researchers, however, seem to have adopted an in- 
termediate and rather intriguing position; although recognizing that 
respondents may use the question mark response category for a variety 
of reasons, they code it as if it were an indicator of a middle position 
within a favorableness-unfavorableness continuum, i.e., as if it 
represented the arithmetic average of the two positions on either side 
of it. This position is exemplified by Osgood, Suci, and Tannenbaum 
(1957), in their instructions for the semantic differential scale: 


The direction toward which you check, of course, depends upon 
which of the two ends of the scale seems most characteristic of the 
thing you’re judging. 


If you consider the concept to be neutral on the scale, both sides of 
the scale equally associated with the concept, or if the scale is com- 
pletely irrelevant, unrelated to the concept, then you should place 
your checkmark in the middle space: 


а angerous 


Osgood, Suci and Tannenbaum then suggest the following scoring 
System: 


Clearly, this is equivalent to considering the “neutral” point as an in- 
dicator of a middle position along a continuum. Most elementary text- 
books on attitude measurement (e.g., Oppenheim, 1966, p. 133) also 
Suggest this approach, 

Therefore, it appears that the most popular current practice fol- 
lowed by attitude researchers with respect to the interpretation of the 
question mark response category rests on the implicit assumption 
that: either (1) respondents use the question mark response category аѕ 
an indicator of an intermediate Position to a far greater extent than 
they use it for other reasons so that this particular use is the only one 
Which really deserves consideration; or (2) respondents use the ques- 
tion mark response Category for many various reasons but these rea- 
Sons tend to counterbalance each other so that the final effect is similar 
to the one obtained had Tespondents used the question mark response 
category for the sole Purpose of expressing a neutral position. 

This paper addresses itself to the problem of testing this double as- 
sumption. 


If the question mark response category is used by respondents 


| 
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almost uniquely as an expression of an average position on the 
agree-disagree continuum, two specific results should be expected. 

First, when we plot the respondents' total (all-item) scores against 
the scores obtained on any item of a scale, the dispersion of the total 
scores obtained for those respondents who checked the “2?” answer оп 
the item under consideration should be similar to the dispersion of the 
total scores obtained for the respondents checking any other response 
category. This implies that the respondent's use of the “2” answer is 
not more equivocal than the use of any other response category and 
therefore the “2” answer represents a given position on the construct 
being measured in the same manner as do the other response 
categories (given the assumption that the scale under consideration is 
internally consistent). 

Second, the mean of the distribution of the total scores obtained for 
those respondents who used the “?” answer should fit the pattern of 
means suggested by the distributions of total scores corresponding to 
the other response categories. The reason of course is that the “2” is 
assumed to indicate an average position on the agree-disagree соп- 
tinuum. 

Figure 1 illustrates the above situation: the mean of the distribution 
of total scores for the question mark category (Š) fits the pattern of, 
means (here a straight line) and its dispersion of total scores is similar 
to the other categories' dispersions of total scores. 

When the question mark category is used for other purposes than to 
indicate an intermediate position, this particular pattern should not be 
expected, In fact, if we assume that the motivations underlying the use 
of the question mark category vary among respondents, the observed 
standard deviation for the question mark category distribution should 
be significantly higher than the standard deviations obtained for the 


А = Strongly Agree 


а = Agree 
Tota] s, 
cores ? 
d = Disagree 


D - Strongly Disagree 


А. а лдары» 
.Figure |. Hypothetical total scores for each response category indicat 
difference in variance of total scores. 


ing no 
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other response categories' distributions, although the means can still 
remain unchanged. Ш, furthermore those motivations do not 
“counterbalance” each other, then the mean of the “2” distribution of 
total scores will not be in line with the other means. 

This is shown in Figure 2. First, the dispersion of the total scores ob- 
tained for those respondents using a "'?"' is greater than the dispersion 
of total scores obtained for those responses using any other response 
category, indicating that the reasons leading respondents to mark the 
“2” answer are more numerous and more heterogeneous than the 
reasons which led them to mark any other response category; second, 
the mean of the “2” distribution does not follow the pattern of means 
suggested by the other response categories, indicating that the final 
effect when all the reasons for using the “? answer are averaged, сап- 
not be considered as equivalent to the one obtained had respondents 
used the “2” response for the unique purpose of indicating an average 
position along the agree-disagree continuum. 


Method 


In order to test which of the above situations was most represen- 
tative of the way in which respondents interpreted items, data were 
used which were collected from 300 respondents in connection with a 
project of a graduate class in Social Attitude Measurement. The class 
was given the assignment of constructing a series of items both of the 
Likert type (“Strongly Agree," “Agree,” “2”, Disagree," "Strongly 
Disagree") and polar type? ("Strongly Agree with A," “Agree with 
AU "7", "Agree with B," (B being the polar opposite of A), "Strongly 
Agree with B"). The items were designed to represent unidimensional 


, Polar choice items are constructed such that two alternative poles of the same 
dimension are incorporated into one question. Each pole is worded in such a way as (0 
allow any respondent to agree with one of the two poles. The alternatives are chosen 
such that they аге “polarily opposite" in meaning, i.e., agreeing with one alternative 
logically prohibits a person from agreeing with the other, The format and instructions 
typically employed are as follows: 

Each item Consists of two alternatives, A and В, between which you are asked 10 
choose by circling one of the appropriate indicators: 


A p osi А is entirely preferred to Statement B as an expression of my opinion. 
А tatement А is somewhat preferred to Statement В. 
? I cannot choose between A and B. 


н Statement B is somewhat preferred to Statement A. ini 
Statement B is entirely preferred 10 Statement A as an expression of my opinion: 
Example: 


A У а I feel most people аге generally happy. 


B bI feel more people are generally sad. 


р 
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A - Strongly Agree 


а = Agree 
Total Scores ? 
d - Disagree 


D = Strongly Disagree 


A a ? d D 


, Figure 2. Hypothetical total scores for each response category indicating differences 
in variance of total scores. 


issues in one of the three following areas: childless vs. childful mar- 
riage; the generation gap; and the drugs issue, Within each area or 
team, approximately ten scales were constructed, each one on à 
different dimension of the same issue. For each dimension within each 
issue, there was one scale comprising eight polar choice statements, the 
other scale representing a random ordering of the two derivatives of 
each polar item, so that each Likert scale contained sixteen items.* In 
total, each team had about 100 respondents, each filling out all 
questionnaires developed within the team. A criterion-related validity 
measure and a reliability test were applied to all scales in order to iden- 
tify the best of them. The 10 best scales were selected, each having a 
Kuder Richardson formula reliability coefficient of greater than .60 on 
both Polar and Likert scales, as well as a convergent validity index 


greater than .20. 
Then, for each such scale, total scores were recomputed according 


to the following scoring system: 


Format Response Category Coded As Scored As 
AKT cy roma س‎ 
Likert Strongly Agree A 4 
Agree a 3 
9 7 
Disagree d 2 
Strongly Disagree D 1 
Polar Strongly Agree with A А 4 
Agree with А a 3 
? ? 
Agree with В b Я 
Strongly Agree with B B 


Eee 
* In this analysis, the criterion for determining wh 
based exclusively on the use of the “7” response catego! 


inconsistent to agree with Likert A form and disagree 
same time have selected the “?” on the polar item (this pattern should have followed 


from agreeing with the A side of the polar choice question). Conversely, it is consistent 
to disagree or agree with both Likert items and mark a “2” for the corresponding polar 
Statement. 


о is consistent and who is not is 
ry on the polar item. That is, it is 
with Likert B form, and at the 
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Thus, the contributions of the question mark response category to 
the total scores were ignored. This was necessary since scoring 
responses the traditional way would have implied that the question 
mark response category is an indicator of an average position. 

As a result of this modification, the total scores of those respondents 
who used at least one “2” answer were based on less than the total 
number of items of the scale. Thus, to allow the comparisons of scores 
based on a different number of contributing items, average rather than 


total scores were used. They were computed according to the following 
formula: 


Total score computed on a 4-point scale 
Average total score — Total number ol items—Number of items 
for which the respondent used the 2” 
response category 
Finally, before comparing the variances of average total scores ob- 
tained for each response category of each item, it was necessary to 
remove the contribution of the responses for the item under considera- 
tion from the average total score. Otherwise, its inclusion would have 
resulted in double counting and the data would have been "con- 


taminated." Thus for each item, average total scores were recomputed 
as follows: 


Response to the item 
under consideration 


Polar Format. Likert Format 


А А New Average total score = Total score — 4. 
No. of items — 1 
a a New Average total score = Total score — 3 


No. of items — 1 


Average total score (unchanged) = 
Total score — 0 
No. of items — 0 


=2 
New Average total score = Total score 
No. of items — | 


U 
B D New Average total score = Total score ~1 | 
No. of items — 1 


After all these modificat 
Scores “2” 


the varian 


ions, the variance of the new average total 
response category was compared to a weighted average of 
ces of the new average total scores obtained for the other 
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response categories and the corresponding F ratio was tested for 
statistical significance. 


Results 


As can be seen in Table 1 the average total score variance of the “?” 
category is significantly higher (at the .05 level) than the average 
weighted variance computed for all other categories in only four cases 
out of 240, and significantly smaller in 11 cases. Therefore, it can be 
concluded that the variability of the average total scores of respon- 
dents selecting the “?” response category is not statistically different 
from the variability of the other categories. The average variability for 
each response category shown in Figure 3 confirms this result. 

Furthermore, Figure 3 shows that the means of average total scores 
obtained from each category across both types of formats are virtually 
on a straight line, and that the mean corresponding to the ''?"' response 
category holds an intermediate position with respect to the other 
means, indicating that those choosing the “?” category are those who 
are in the middle of the attitude continuum as measured by the average 
total scores. Therefore, the assumption underlying the scoring of the 
“?” response category made by attitude researchers when coding their 
items is supported by empirical evidence across ten different scales in 
both polar and Likert formats. 

To test the scope and degree of stability of this conclusion, it was 
decided to observe the behavior of the responses obtained in a context 
where the potential uses of the “2” response category аге increased and 
its meaning further delineated. Such a context is provided by the 
simultaneous comparison of each polar choice question and its two 
Likert derivatives. Thereby the pattern of responses of individuals 
across essentially three similar questions can be followed. 


Analysis of Question Mark Response across 
Polar and Likert Formats 


Table 2 illustrates the distribution of responses on the two Likert 


questions, given the selection of a “7” on the polar choice format. As 
can be seen, respondents choosing the “2” response category on a 
Polar choice question have a variety of reasons for doing 50. 

The largest individual cell іп the table is the one predicted from the 
hypothesis that a person choosing the 2” for the polar choice ques- 
tion will also choose a “27” in the two Likert derivatives of the same 
Нет. Eighteen percent of the respondents fell in this category. This 
Tesponse pattern represents those who may be called the "truly un- 
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LIKERT FORMAT 


n = 15136 
allitems, all scales 


SA A ? D sD 
Mean 2.705 2.530 2.419 2.343 2.203 
Standard 
deviation .540 .813 4,538 .400 „529 


POLAR FORMAT 


п = 6992 
all items, all scales 


SA A ? A SA 
with with 


A B 


Mean 2.596 2.491 2.350 2.297 2.135 
Standard 
deviation .653 .516 „647  .535 .650 


Figure 3. Means of average total scores obtained for each response category. 


decided." The label "truly undecided” may be a misnomer, however, 
since this 18% may not actually represent an equal and infrequent ШТ 
of the “2” by a majority of respondents, but may be due to a lew 
Persons consistently employing a “?” across different question ae 
tents. Such a pattern of responding has received wide attention in t i 
Psychometric literature (€.g., Cronbach, 1946; Bentler, Jackson, an 
Messick, 1971) under the general name of response styles. 

It is unlikely that a person will consistently respond with a 
across scales dealing with relatively different topics. Accordingly, а 
Tough test of the existence of еуі 


«n 


dence that some persons may bes? 
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TABLE 2 
Distribution of Responses across Both Likert Scales, Given the Selection of a “?” on the Corres, 
Polar Item 
LIKERT B 
Strongly Strongly 
Count Agree Agree "T Disagree Disagree Row 
Tot. Pet. 1 2 3 4 5 Total 
Strongly Agree 8 13 24 17 27 9 
1 © 10 19 1.3 2.1 10 
Agree 8 47 71 74 21 227 
2 6 3.7 5.6 5.8 2.1 179 
Likert A 8 38 225 78 29 378 
3 © 3.0 17.5 6,1 23 298 
Disagree 13 89 83 157 36 378 
4 1.0 7.0 6.5 12.4 2.8 298 
Strongly Disagree 20 26 32 35 85 198 
5 1.6 20 2.5 2,8 6.7 156 
Column Total 57 213 435 361 204 120 


response stylers, a simple frequency count of the number of times 
someone chooses а “2” was performed. 

From this frequency count, it was found that the 18% of the 
responses found in this cell was not 18% of the sample as would be ex- 
pected if the hypothesis of no “?” response styles were true. Instead, 
9% of the respondents in our sample (ог 22% of the number of people 
comprising the initial 18%) accounted for 559» of the triple “2” thereby 
lending credibility to the hypothesis of the potential presence of a 57! 
response style. Additional support for this hypothesis can be found as 
well in the psychometric literature (cf. Rosenberg, Ixand, and Hol- 
lander, 1955). 

All other responses appearing in Table 2 can be categorized as those 
who are rationally consistent? on both Likert questions, or those, who, 
if they understood the question, were inconsistent, given that they ac 
curately selected the “2” on the polar format. 

_ Persons who are consistent are the type of people who, when choos- 
ing "?" on the polar choice will agree with one Likert question as well 
as agreeing with the other, the “polar opposite.” Six percent of total 
Tesponses fell in this category. Another possible response pattern for 

consistent respondents" is disagreeing with both Likert questions. 
Almost one-fourth of the responses fall into this category. This dis- 


МОК хх X X x x x XX XJ 


5 “ А 
For example, the “happy/sad” polar item would have been translated into the 0 | 


following Likert format Statements: 


1 feel most people are generally ha а?ар 
I feel most people аге а Hd x А a?dD 
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crepancy found between these two percentages may be due to the 
specific format under which the polar choice method is developed. 
This assumption seems supported by the fact that the number of dis- 
agreements with each Likert item was higher than the number of 
agreements (44.5% versus 21.3% and 45.4% versus 24.9%) as can be 
seen in the marginals in Table 2. Were the polar choice format con- 
structed with negative polar opposites instead of positive opposites 
(i.e., respondents would be asked to disagree with one of the two polar 
statements) the reverse trend would probably occur, i.e., more people 
would be consistent in agreeing rather than consistent in disagreeing. 
All persons falling in a "?" category on either Likert item can be 
considered as rational if their reason for sélecting a “7” on one Likert 
scale is stronger than their reason for selecting another response 
category on the other Likert item. Otherwise, they can be considered 
as irrational responders. Since “Strongly Agree” represents a firmer 
commitment than “Agree” it seems that those persons who strongly 
agreed with either Likert should not have used а “?” in the first place 
(unless their reasons for selecting the “?” on the other Likert are very 
strong) and probably fall into the category of “irrational responders." 
The same argumentation can be used to tentatively classify those who 
used a “а” on one Likert and a “?” on the other as “rational.” 
The final category is that of the inconsistent respondents who are 
those who agree with one Likert and disagree with the other: they 
should not have selected the “2” on the polar item in the first place. 
Table 3 presents a more concise summary of the basic types of 
responders. As can be seen, just under 40% select the question mark 


TABLE 3. 


Item Response Item Response Percentage of 


on Likert 1 on Likert 2 Respondent Segment all responses 
? ? truly undecided and/or 17.7% 
response stylers SE 
? aord 
rational respondents 21.2% 
aord 9 
Аога аогА 5.9% 5% 
consistent respondents М 
D ord D ord 24.1% 
AorD 
| irrational respondents 7.3% 
А 
Ааа dorD 11.3% [ 302% 
inconsistent respondents 11.6% 
Dord АогА 
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for the reason most researchers assume, and the remaining sixty per- 
cent are split between being consistent and being what is called here, 
irrational, or inconsistent. 


Analysis of Responses on Both Likert Questions Only 


Table 4 presents similar results for the two Likert questions. As can 
be seen, the most popular response again is the “2” on the second 
Likert, given its selection on the first. Although this is to be expected, 
as mentioned above some of these responses may be due to the 
presence of a “2” response style. The next most frequent categories are 
those on either side of the question mark, and finally only a small 
minority select the extreme responses. This, again, is generally consis- 
tent with the scoring pattern employed by most researchers. 


Discussion 


The foregoing sections show how some understanding of the mean- 
ing of the “?” response category can be achieved from an analysis of 
responses across different item formats as well as items" average total 
Scores and response category variability. If further insight is desired, it 
is suggested that researchers extend for example the study by Goldberg 
(1971) in allowing respondents to state their reasons for selecting the 
"?" (e.g., they would allow respondents to indicate whether they are 
indifferent; have mixed feelings; the question is ambiguous; the 
responses don't fit the question, etc., and in addition in the polar for- 
mat that they disagree or agree with both poles). Also, it would be of 


$ TABLE 4 
Analysis of 9?” Responses across Likert Questions 
Responses on LIKERT A Responses on LIKERT B 
from “2” on LIKERT В from “?” on LIKERT А 
A 50 A 90 
5.9% 8.8% 
a 190 & 250 
y 22.5% 24.8% 
? 276 ? 376 
К 44.5% 37.3% 
160 а 208 
D 18.9% 20.6% 
"E 2% * 5 
Column Totals 845 1006 
100.0% 100.0% 


Note—Numbers а) 


Ppearing in each cell cent of total 
(column) responses, represent the count and peri 
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interest to record respondents' reactions to being able, and not being 
able to select a middle category on a set of items (i.e., scale items 
would be presented under both five and four-point formats). 
Responses to items in both formats then would be compared as well as 
respondents’ reactions to the entire scale. It would also be interesting 
to analyze the frequency of the use of the *9" when the number of 
response categories are systematically varied. 

Finally, the investigation of the existing relationship between 
perceived knowledge or familiarity and interest in the area under 
analysis and frequency of use of the “2” would provide additional in- 
sight into the respondents' use of the question mark response category. 

The results reported in this paper lead to the conclusion that the 
variability of meanings of the “2” response is not greater than the 
variability of the other response categories. Even though the current 
practice of including a “2”, and treating it as the middle category 
seems supported, it does not mean that “9 is only an indicator of an 
average position as most investigators assume. For example, some 
evidence exists for the presence of a “2” response style and of am- 
bivalent or indifferent responses. 

In conclusion, this study suggests that a researcher would be advised 
to check the variability of his “?” responses against the other response 
categories. This would indicate the range of meanings attached to the 
“7”, as compared to the other response categories. Next, by looking at 
the mean of the “2” response distribution, he would know what ap- 
propriate score must be given to these responses. In other words, the 
researcher should be more careful in specifying the meaning of the 
category rather than just naively assuming that it indicates an average 
position. 

Despite its wide use in attitude scales, 1 
Other response category for that matter, has received far less 
theoretical consideration than needed if researchers want to validly at- 
tribute meaning of responses to items in attitude scales. Tt is hoped 
that this study represents a step in the analysis of the ТД) responses 
Which will stimulate other researchers to pursue in this direction, and 
more generally will sensitize users of attitude scales to an important 
methodological issue, that of determining empirically the meaning 
respondents attribute to their responses to an item. 


the question mark, or any 
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SECTION SELECTION IN MULTI-SECTION COURSES: 
IMPLICATIONS FOR THE VALIDATION AND USE OF 
ў TEACHER RATING FORMS 


LES LEVENTHAL, PHILIP C. ABRAMI, RAYMOND P. PERRY 
AND LAWRENCE J. BREEN”? 


University of Manitoba 


Researchers know little about determiners of section selection in 
multi-section college courses. Studies on teacher evaluation and on 
the validity of teacher rating forms have often assumed section to 
section equivalence of students assigned by customary registration 
procedures, To investigate the section selection process, a question- 
naire containing items on personal history, reasons for section selec- 
tion, and sources of information about the instructor was adminis- 
tered to 1,188 undergraduate students in multi-section first year and 
advanced psychology courses. Major findings were: (1) students sig- 
nificantly differed across sections on biographical variables and on 
section selection reasons, (2) time at which class was scheduled 
(classtime) and teacher's reputation were the primary reasons for 
section choice, (3) teacher's reputation was less important than 
classtime for first year students, but comparable to classtime for ad- 
vanced students, and (4) reports from other students and published 
ratings were, respectively, the first and second most frequent source 
of instructor reputation information. 


CAMPBELL and Stanley (1963) 
researchers have analyzed studies that use a 
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of students to classes as though random assignment had been em- 
ployed, and therefore often have reached incorrect conclusions about 
treatment effects. Research into section selection and the equivalence 
of students in different sections is crucial to the literature on teaching 
effectiveness and on the validity of teacher rating forms (e.g., Costin, 
Greenough, and Menges, 1971). Researchers who have tried to vali- 
date a teacher rating form (TRF) frequently have assumed that dif- 
ferences among teachers on TRF ratings are due to teaching effective- 
ness rather than to initial student differences. They typically have 
studied multiple sections (taught by different teachers) of the same 
course and have computed the correlation between section means on 
à TRF and section means on validity criteria, such as a common 
final examination. With few exceptions (Sullivan and Skanes, 1974), 
students have not been randomly assigned to sections. 

Leventhal (in press) has argued that performance differences among 
sections on validity criteria may be due to factors other than teacher 
ability such as student ability and motivation. Furthermore, Leventhal 
has suggested that even when teachers have no effect on validity 
criteria, a correlation between TRF means and validity criteria means, 
which may nevertheless оссиг because of the lack of randomization, 
тау mistakenly be used as evidence for TRF validity. In the present 
investigation an attempt was made (a) to examine the section selection 
process to determine whether the process approximates random assign- 
ment and (b) to assess whether the process for advanced under- 
graduates differs from that for first year undergraduates. 


Method 
Subjects and Setting 


The subjects for this Study were 940 students from 13 Introductory 
Psychology Sections and 248 advanced undergraduate students from 6 
Sections of Social Psychology. During the academic year prior to this 
study, the University of Manitoba Student Union (UMSU) con- 
structed, administered, and analyzed a TRF for all departments in the 
Faculty of Arts, which includes Psychology. The results of these rat- 
ings were published and made widely available at the time students 
Tegistered for the courses investigated in this study. 


Materials and Procedure 


‚А 22-Нет questionnaire was constructed in which the first eight 
items related to the following student demographic characteristics: (1) 
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education, (2) age, (3) sex, (4) family income, (5) hometown popu- 
lation size, (6) college grade-point average (GPA), (7) high school 
GPA, and (8) an indication of whether or not the course was required. 
Items 9 to 16 assessed specific reasons for section choice, which in- 
cluded: (9) identifiability and importance of section choice reason 
(clarity of reason), (10) classtime (scheduled time), (11) desire to be 
with friends, (12) classroom location, (13) physical features of class- 
room location, (14) nature of assigned reading materials, (15) teaching 
ability and/or reputation as a teacher, and (16) any reason other than 
those listed. Response alternatives to section selection items ranged on 
a 4-point scale from “selection based entirely upon” the reason (one) 
to "selection not at all based upon" the reason (four). An “I don't 
remember" alternative was also provided. The remaining items, 17 to 
22, related to the following kinds of knowledge which respondents had 
about their instructor at the time of registration: (17) awareness by 
student of his/her instructor's published TRF evaluation, (18) infor- 
mation regarding whether or not the instructor had been evaluated by 
UMSU, (19) amount of pre-enrollment information about instructor's 
teaching ability and/or reputation, (20) source of information, (21) 
pre-enrollment conclusion about instructor, and (22) accuracy of pre- 
enrollment information. The questionnaire was administered three to 
four weeks into the regular session term. 


Results 


Reasons for Section Selection 


A mean and standard deviation for each course on items 9 to 15 
were computed (see Table 1). The low means for items 9 show that 
students in both courses generally maintained that they had a clear 
reason for their section choice. Of the specifically named reasons (10 to 
15), time of scheduling (classtime) and teacher's reputation Were the 
primary reasons for both courses. Although time of scheduling was 
More important than reputation for Introductory Psychology dn 
dents, time was comparable to reputation in importance for е 
Psychology students. Furthermore, the rank ordering of importance o 
the set of reasons was similar for both courses. For each of the 
Specifically named reasons, a one-way analysis of variance (ANOVA) 
was computed. Results indicated that Social Psychology students de- 
clared that they had more clearly identifiable reasons for section selec- 
tion than Introductory students (F — 13.96, df = 1/1168, p <.001). 
None of the specific reasons significantly differed in importance for 
the two courses except reputation of the professor. Social Psychology 
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TABLE 1 
Introductory Psychology and Social Psychology Students 
Compared on Section Selection Reasons 


m M SD 
Item Intro. Social Intro. Social Intro. Social Е, P 
Psych. Psych. Psych. Psych. Psych. Psych. 
9. Clearreason 940(921) 248 (248) 227 1.96 1.17 1.06 1396 <( 
10. Time 940 (921) 248(245 2489 275 1.09 1.00 3.21 M 
11. Friends 940 (923) 248 (247) 383 3.89 053 041 3.34 % 
12. Location 940(925) — 248(247) 3.74 378 058 046 1.10 2 
13. Room 
Features 940(915) — 248(246) 397 400 022 0.06 2.97 1 
14. Readings 940 (920) 248 (246) 377 372 053 053 134 Д 
15. Ability/ 


* Numbers in parentheses i 
were dropped. 


Reputation  940(928) — 248(245) 333 284 100 104 44M <M 


students reported reputation to be more important to section selection 
than did Introductory students (Е = 43.3, df = 1/1172, p «.001). 

To obtain a simple overall picture of the frequency with which 
Students in each course cited specific reasons to be of dominating 
importance, for each course the percentage was computed of students 
who had based their section selection (a) mostly or entirely, or (b) not 
at all on that reason (Table 2). The data again indicate that two major 
reasons, classtime and reputation, as well as a collection of minor 
reasons apparently influence students" choice of sections. But even the 
most potent reason, classtime for Introductory Psychology students or 
reputation for Social Psychology students, mostly or entirely in- 


TABLE 2 
Reasons Reported by Students for Section Selection 
% basing decision % basing decision not at 
mostly or entirely on all оп, or failing to recall, 
reason ronson ig 
Intro Social Intro Social 
Psych. Psych. Psych. Peyar 
Time of Class 253 
Teacher's ability/reputation d 5 PE 385 
Readings 38 32 81 5 76.1 
Location 48 12 801 79.8 
Friends 4.1 20 88.5 923 
Room Features 0.6 0.0 98.0 99.6 


pou TU | 
'ote.—Since students responding “somewhat” to the importance of a reason were excluded, the percentages 
оп а row for a course do not total 100%, 


indicate number of responses analyzed after defective responses and "| don't remember" respo 


— g— 
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fluences the decisions of fewer than 40% of the students, and а com- 
parable percentage of students ignores the reason altogether. These 
data, therefore, reveal that although reasons greatly differ in impor- 
tance, no single reason mostly or entirely determines the section selec- 
tion for the majority of students. 


Approximation to Random Assignment 


To determine whether the section selection process for a course 
approximates the random assignment of students to sections, analyses 
were made of section selection and biographical items. Introductory 
Psychology sections were analyzed separately from Social Pscyology 
sections. One-way analyses of variance (ANOVAs) were computed 
across sections for each question (see Table 3) even though certain 
questions have alternatives that do not meet the formal interval data 
assumptions of the ANOVA (Burke, 1963). Taken as a whole, the data 


TABLE 3 
Probabilities Associated with F Tests for Analyses of Variance Computed for Each 
Question across Sections 


Introduction to 


Psychology Social Psychology 
Item п = 940 п = 248 
Demographic 
ШЕОЛ: <.001 053 
УА 21001 324 
3. Sex <.001 TUA 
4. Family Income 681 о 
5. Hometown population 001 2 
6. College GPA 616 117 
7. High School ОРА 049 Ug 
8. Course required «.001 194 
Section selection 
9. Clear reason 013 QM 
10. Time 001 ШУП 
11. Friends 641 2 
12. Location 001 ти 
13. Room features .388 550 
14. Readings 317 Pa 
15. Ability/reputation 5.001 И 
16. Another reason Siu ; 
Information on instructor 080 
17. Looked up TRF ratings <.001 оп 
18. Instructor previously rated <.001 j 
19. Amount pre-enrollment <001 
information <.001 (027 
20. Source of information 060 <001 
21. Pre-enrollment conclusion <.001 2001 
22. Accuracy of information <.001 У 
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in Table 3 show that the sections within a course significantly differed 
from each other on both biographical dimensions and section selection 
reasons. This outcome is especially clear in the analysis of the Social 
Psychology course. Though only six Social Psychology sections were 
tested, five out of the seven items that relate to reasons given by stu- 
dents for choosing their sections (items 9 to 15) showed significant 
differences across sections beyond the .05 level, and three out of seven 
were significant beyond the .01 level. Furthermore, a high degree of 
agreement between the two courses was shown on which of the seven 
questions yielded significant section differences. 


Sources of Students’ Information 


Combining all students from both courses, students revealing that 
they had had pre-enrollment information about their prospective in- 


Structor's teaching ability or reputation were asked to indicate the 


Source of this information and their conclusion about the prospective 
instructor. Cross-tabulation tables were prepared and translated into 
the form of a chart that related sources of pre-enrollment information 
and pre-enrollment conclusions about prospective professors (see Fig- 
ure 1). These data suggest that regardless of pre-enrollment conclusion 
about the instructor, a student Will retrospectively recall the following 
as useful sources of pre-enrollment information (most frequently re- 
called source first): comments from other students, UMSU booklet, 
other sources, Pre-enrollment audition, and previous experience with 
the instructor. In addition, other students were seen as а far more 
important source of information than were the remaining sources. 
Finally, all sources of information generally appeared to be about 
equally useful sources of favorable and unfavorable information with 
the exception of “other students”: the more favorable a student’s pre- 
enrollment conclusion, the more likely the student would maintain 
that other students had been the source of useful information. 


Accuracy of Information 


Analysis of the accuracy of pre-enrollment information item showed 
that 31.0% stated that information was “totally accurate," that 46.5% 
declared “mostly accurate,” that 20.0% reported "somewhat accu- 
rate," and that 2.6% cited "completely wrong." To determine where 
Students in each of these designated response categories obtained their 
information, cross-tabulation tables were prepared and then cast in 
the form of a chart that related post-enrollment judgment about ac 
curacy of pre-enrollment information about their instructors (see Fi£ 
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SOURCE OF INFORMATION 
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Figure |. Sources of information about instructors claimed by students reaching 


certain pre-enrollment conclusions about their instructor. 


ure 2). These data suggest that all sources of information were judged 
accurate by some students and inaccurate by others. In addition, the 
More a student judged his pre-enrollment information to be accurate, 
the more likely he was to credit other students as an informative source 
and the /ess likely he was to credit published UMSU ratings. 
Analysis of responses to item 9 showed that about 80% of all stu- 
dents reported that their reasons for selecting their section were some- 
what, mostly, or clearly identifiable. An analysis of item 16 showed 
that about 72% declared that their section selection was not based, or 
only somewhat based, upon a different reason from those reasons 
listed in the questionnaire. In short, a large majority of students hada 
Specific reason for choosing their section—a reason that was listed in 
the questionnaire. The final portion of item 16, which consisted of a 
fill-in blank, requested students to describe any other reasons than 
those already listed for choosing their section. Of all students tested, 
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SOURCE OF INFORVATI 


5.7 (QUESTION 0) 
RE 
SE 9 “ 
Е B OTHER STUDENTS 
E 
ET 
E mE 
Е E UNSU BOOKLETS 
20 ص‎ 
OTHER 
10 ADDITION 
55 22. PREVIOUS ӨРЕН 


COVPLETELY — SOVENHAT MOSTLY TOTALLY 
WRONG ACCURATE ACCURATE ACCURATE 
N = 19 N = 152 N = 212 N = 195 


POST-ENROLLIENT JUDGMENT ОН ACCURACY OF ENROLLMENT INFORMATION 
(QUESTION 22) 


Figure 2. Sources of information about instructors claimed by students reaching 
certain Post-enrollment judgments On the accuracy of that information. 


72% ignored the fill-in blank, 12% reported that they had no a 
other than to select the section, 10% apparently did not understan 


item 16, and 6% listed a Teason different from those other reasons in | 
the questionnaire, 


А correlation сое 
tween mean import 
bility of rand 


Provides a Conservative estimate of the degree to which responses ie 
the questionnaire ате controlled by variables related to actual section 
selection. Such a Correlation was computed over items 10 to 15 be- 
tween the mean importance of item for all sections combined, and the 
Probability of obtaining an F ratio at least as large as the one found 
for Section to section variation in sections means. The correlation 
Coefficient was .64 for Introductory Psychology, .62 for Social Psy- 
chology, and .73 for both courses combined. 
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Discussion 


Section Selection Process 
The results of this study show that classtime (scheduled time of 
class) and teacher’s reputation are the first and second most important 
variables controlling section choice, Although reputation lags in im- 
portance considerably behind classtime for freshman students, reputa- 
tion is comparable to classtime for advanced undergraduates. The 
instructor reputation data have implications for student evaluations of 
teachers, Perry, Niemi, and Jones (1974) have found that students 
more favorably rate highly reputed teachers than they rate poor ones. 
The present data indicate that many students admitted seeking repu- 
tation information prior to registration—a tendency that was stronger 
with advanced undergraduates than with less advanced students. In 
addition, since the present data reveal that students significantly varied 
across sections in their statements of the importance of a teacher's 
reputation, it appears that this declared reason was, in fact, relevant to 
section choice. Thus, one may infer that teachers tend to be locked 
into their reputations and that this tendency is stronger in advanced 
courses than in introductory ones. More importantly, the present data 
demonstrate that students significantly varied from section to section 
along a number of biographical dimensions and according to their 
reasons for section choice. In short, the assumption that the section 
selection process is sufficiently complex to approximate random as- 
signment was not supported by this study. | 
Furthermore, it would appear that many of these factors may in- 
Пшепсе either the TRF ratings by students or their performance on 
TRF validity criteria. For this reason, failure to randomize students 
may produce inaccurate estimates of teachers’ impact on student per- 
formance on completing TRF's and on TRF validity criteria. The 
present findings that students vary from section to section on more 
than just one dimension and that no section selection reason mostly or 
entirely influences the decisions of more than 40 per cent of the stu- 
dents implies that the section selection process is controlled by many 
important variables rather than by a single dominating one. Hence, 
non-randomized TRF validation studies that use statistical control 
techniques (e.g., part and partial correlation) to control for initial 
student differences among sections must control many student varia- 
bles. Typically, these studies have controlled only student ability (е... 
Elliott, 1950). 
Sources of Students’ Information 


The present data were collecte: 1 
TRF results were available to students during cours 


d at a university where published 
e registration. If 
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student differences across sections are primarily due to selection of 
highly reputed teachers, and if interest in teacher reputations is due to 
the availability of published ratings, then the present data may be 
representative only of institutions that publish TRF evaluations. On 
the other hand, if students acquire information about teachers from 
other sources—e.g., other students—then the present data may have 
wider generality than they would otherwise have. The present in- 
vestigation shows that other students were the most frequent source of 
information about instructors; hence, the section selection processes 
identified in this study may also occur at institutions where published 
TRF results are unavailable. 

Generally, all sources of information except “other students,” were 
remembered to be equally useful sources of favorable and unfavorable 
knowledge for decision-making. Hence, there is no strong evidence 
that the availability of published TRF results would greatly change 


the balance of favorable and unfavorable information available to 
students. 


Cautions in Interpreting Results: Implications for Valid Use of a TRF 


The checks on the adequacy of the questionnaire indicate that the 
questionnaire did not omit any important reasons for section choice 
and, more importantly, that responses to the section choice items 
reflected the actual selection process. Nevertheless, the present data 
should be interpreted with caution so that their validity may not be 
misrepresented. First, these data are correlational; causal inter- 
pretations must be made with care. Second, all data are retrospective 
Teports vulnerable to distortions and colorations. For example, à 
Source remembered as providing useful information may not have 
influenced section selection. Third, many of the present findings 
achieve practical significance, not because the variables studied were 
shown to be of dominating importance to section selection, but be- 
cause they are sufficiently potent to provide re-interpretations of exist- 
ing Tesearch which typically involves large groups and uses powerful 
statistical techniques. For example, only 21% of Introductory Psychol- 
ogy students studied maintained that they had based their section 
choice mostly or entirely on a teacher's reputation (see Table 2). 
Nevertheless, such a percentage may produce significant section 10 
section differences in student characteristics associated with interest in 
teacher's reputation when the differences are computed in а study €m- 
ploying nonrandom assignment to sections of more than 20 students. 
For the various reasons cited, the appropriate use of the results ob- 


tained from à TRF may be difficult to ensure in specific college and 
university settings, 


З 
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ESTEEM CONSTRUCT GENERALITY AND ACADEMIC 
PERFORMANCE 


C. KENNETH SIMPSON AND DAVID BOYLE’? 


Cleveland State University 


Measures of global, specific, and task-specific self-esteem were 
administered to 78 male and 81 female college students and related to 
predicted and actual performance оп а midterm examination. Signifi- 
cant correlations were found between global and specific measures 
and between specific and task measures, but not between global and 
task measures. The relationship between the esteem measures and 
actual performance was strongest for the task measures, next strong- 
est for the specific measures, and nonsignificant for the global meas- 
ures. Specific measures were also significantly related to predicted 
performance, but global measures were not. The findings were dis- 
cussed in terms of four criticisms of global measures, and it was 
suggested that more specific self-esteem measures be developed. 


THE three major self-esteem constructs identified in the literature 
thus far can be seen as representing different levels of generality in an 
hierarchy of esteem constructs. Global self-esteem, usually defined as 
an individual's evaluation of his overall worth as a person, is assumed 
to be the weighted function of esteem in more specific areas. Specific 
self-esteem refers to evaluations either made in certain life situations 
(social interaction, male-female relations, education, work) or based 
on particular aspects of the individual (physique, intelligence, person- 
ality, interpersonal competence). But each of these sources 15 still 
rather broad, since it includes a multitude of different behaviors and 
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situations. Task-specific or situational self-esteem refers to evaluations 
of more restricted sets of behaviors in specific situations. This con- 
straint can be conceptualized in one way as the expectations by the 
individual of his performance in a task-specific situation. 

In two experimental studies the relationship between both global 
and specific self-esteem measures and behavior has been compared, 
Schrauger (1972) employed a global measure and a measure of task- 
specific self-esteem to study the behavior of high and low esteem 
subjects performing a concept formation task alone or in the presence 
of an audience. Both measures were related significantly to subjects’ 
perceptions of performance and confidence ratings, but only the task- 
specific measure was related to actual performance. Morrison, 
Thomas, and Weaver (1973) found that both the global measure апда 
specific measure (school esteem) from Coopersmith's Self-Esteem In- 
ventory (SEI), but not the global Social Self-Esteem measure of Ziller, 
Hagey, Smith, and Long (1969), were significantly correlated with 
both predicted and actual performance on a midterm examination. 
However, when predictions of performance made after the test were 
adjusted through using actual grades as the covariate, only the global 
SEI measure produced significant results: predicted scores were signifi- 
cantly higher for high than for low esteem subjects. 

The major purpose of the present study was to examine the relation- 
ships between self-esteem measures at different levels of generality and 
academic performance. The first two-fold hypothesis was that global 
self-esteem (GSE) measures would be positively related to specific self- 
esteem (spSE) measures and that specific measures would be positively 
related to task-specific self-esteem (tsSE) measures; it was unclear 
Whether or not GSE measures would be related to tsSE measures. 
Second, it was hypothesized that tsSE measures would be most closely 
related to scores on a midterm examination, that spSE measures 
would be less closely related, and that GSE measures would have the 
lowest degree of relationship, if any. Third, it was predicted that high 
esteem subjects would receive significantly higher grades than would 
moderate esteem subjects and that moderate esteem subjects would 
receive significantly higher grades than would low esteem subjects. 
Finally, it was hypothesized that spSE measures would have a stronger 
relationship than GSE measures to predicted performance. 


Method 


Three different measures were used to assess global self-esteem. The 
first measure was the total score on the Tennessee Self-Concept Scale 
(TSCS) (Fitts, 1965) which is based оп 90 items summed across three 


SIMPSON AND BOYLE 899 


internal frames of reference (identity, self-satisfaction, and behavior) 
each in relation to five aspects of the self (physical, moral-ethical, 
personal, familial, and social self). The second measure was Rosen- 
berg's (1965) self-esteem test (RbSE) composed of 10 items that forma 
Guttman scale. The third measure was a single item (QSE) constructed 
for this study on which subjects rate their global self-esteem on a 10- 
point scale after comparing themselves to descriptions of high and low 
esteem persons provided as anchor points. 

Two measures of specific self-esteem were also developed since nei- 
ther the TSCS nor the RbSE has subscales relevant to academic per- 
formance. A rating of intellectual esteem (inE) was obtained by asking 
subjects, "Generally, how high is that part of your esteem which is 
based on your assessment and evaluation of your intellectual abili- 
ties?" This was judged to represent the complex set of skills most 
nearly pertinent to academic performance. А rating of educational 
esteem (edE) was obtained by asking subjects, “Generally, how high is 
your esteem in academic-educational situations (in your classes and 
other situations directly related to your education)?" This question 
was judged to represent the general area in which academic perform- 
ance would fall. Ratings for both measures were made on a 10-point 
scale, Pilot work done with these two measures and on the QSE 
yielded two-month, test-retest reliabilities of .84, .81, and .77, respec- 
tively. 

Two measures of task-specific self-esteem were obtained from pre- 
dictions by subjects of their performance on a midterm examination. 
Before the test began, subjects were asked to estimate the numerical 
Score they expected to receive on а standard academic scale (А = 90- 
100, В = 80-89, etc.). These estimates along with subjects’ grade point 
averages (GPA ) were collected before the tests were distributed. When 
the examination was over, subjects were again asked to predict their 
grades, These two estimates were designated task-esteem-before (tE-b) 
and task-esteem-after (tE-a). Ў 

Subjects were 78 male and 81 female students enrolled in a soph- 
omore-level psychology course who completed all measures for the 
Study. The TSCS and RbSE scales were administered in class during 
the third and fourth weeks of the quarter; the nine-item Self-Esteem 
Questionnaire was completed during the fifth week. The midterm 
examination, the first of two tests in the course, was given the sixth 
week during which time the two task-specific measures were бише 
The test consisted of 20 short-answer essay questions covering oh 
lecture material and textbook reading assigned for the first half of the 
course. Raw scores were rescaled to conform to a standard academic 
Scale and were used along with scores predicted before the test (the tE- 
b measure) as the dependent measures. 
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Independent ¢ tests performed оп all male-female comparisons re- | 
vealed no significant differences for any of the nine measures except 
midterm grade: females did receive significantly higher scores than did 
males, t (157) = 3.28, p € .01. To check further for any sex differences, 
product-moment correlations were computed among all variables for 
each sex separately. Inspection of the two matrices revealed no major | 
differences in either magnitude or pattern of the correlations; hence 
the data were combined for the remaining analyses. 

The correlation matrix for all subjects was factor analyzed through | 
using the principal components method with a varimax rotation, and 
two factors accounting for 52.7% and 47.3% of the variance were 
extracted. Table | presents intercorrelations and factor loadings for 
each of the nine measures. The two tsSE measures and the two aca- 
demic measures had the highest primary loadings on the first factor; 
hence it was labeled Academic Performance. The high loadings of the 
three GSE measures clearly identified the second factor as Global Self- 
Esteem, The two spSE measures loaded about equally on these two 
factors, | 

Examination of Table 1 indicates that the patterns of relationships | 
among the various esteem measures were as predicted. The mean 
correlation between the three GSE measures (.56) was similar to the | 
correlation between the two SpSE measures (.66) and to that between 
the two tsSE measures (.63), but slightly higher than the correlation | 
between grade and GPA (49). Correlations of measures within à 
subgroup with measures in other subgroups were all of about the same 
magnitude. Low but significant correlations were found between GSE 
and spSE measures (r = .28) as well as between spSE and tsSE 
measures (ғ = .30); however, the GSE measures were only marginally 
related to the tsSE measures (7 = 13). Between each of the global, 
specific, and task-specific measures and the two academic measures the 
means of each set of resulting correlations were —.02, .26, and 45. 
Tespectively. The only substantial difference in correlational indices 
involving self-esteem measures within a subgroup was the higher corre 
lation of the tE-a measure with grade in comparison with the correla- i 
tion between tE-b and grade (.54 vs. .41). 

The distribution of scores for each esteem measure was divided into 
thirds to form high, moderate, and low esteem groups. Subjects whose 
scores fell on the dividing lines were randomly assigned to the appro" 
priate group so as to assure equal size groups. Table 2 presents mean 
predicted grades and mean grades for each of these three esteem 
groups by esteem measure. A one-way analysis of variance performed 
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TABLE 1 
Intercorrelations and Factor Loadings for Esteem and Academic Measures 
Factor 
Intercorrelation Loading 
Measure 2 3 4 5 6 1 8 9 I It 
Global 
1. TSCS 59 50 26 30 17 04 03 00 00 79 
2. RbSE 58 23 26 17 10 01 00 00 81 
3. QSE 27 31 13 16 -9 -%М -001 82 
Specific 
4. inE 66 21 27 24 30 47 55 
5. edE 37 сіз 25714007, 5277748 
Task 
6. tE-b 63 4l 43 74 17 
7. tE-a 54 42 80 09 
Academic 
8. grade 49 76 -10 
9. СРА 75 -06 


Note.—Forn = 159, р < 05 for r = .16;p < 01 forr = 21. 


on the midterm scores of the three groups revealed no significant 
effects for any of the GSE measures and only a marginally significant 
effect for the inE measure. However, significant effects were obtained 
for the edE, tE-b, and tE-a measures. A post-hoc comparison of means 
using Newman-Keuls tests (Winer, 1971) showed that the high esteem 
group did obtain significantly higher scores on the edE measure than 
did the moderate esteem group, 4 (3, 156) = 3.45, p < .05, or than did 
the low esteem group, 4 (2, 156) = 2.82, p < 105. For the tE-b measure, 
high esteem subjects did score significantly higher than did low esteem 


TABLE 2 К 
Mean Predicted Grades and Mean Grades of High, Moderate, 
Esteem Groups by Esteem Measure 


and Low 


Mean Grade 
Esteem Group" 
High Moderate Low 


1 pi 63 713 
51 725 709 119 
703 709 728 
33 700 698 
39 696 704 


Mean Predicted Grade 

Esteem Esteem Group* 

Measure High Moderate Low F* 
00 


82.8 81.0 81.5 1 
83.2 81.5 80.5 1 
83.3 81.5 81.6 1 
84.1 80.6 80.2 3; 
83.7 82.0 79.6 4. 


RES 


ж 
ж 


52,1%. 


74.7 71.8 669 12. 
76.4 70.3 66.1 251%“ 
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subjects, 4 (3, 156) = 7.03, p < .01, and the moderate esteem subjects 
did score significantly higher than did the low esteem subjects, q (2, 
156) = 4.39, p < .01. For the tE-a measure, the high esteem group 
scored significantly higher than both the low esteem sample, q (3, 156) 
= 10.02, p < .01, and the moderate esteem sample, 4 (2, 156) = 5.88, p 
< .01, whereas the moderate esteem group did register significantly 
higher scores than did the low esteem group, 4 (2, 156) = 4.14, p < .01. 

А one-way analysis of variance performed on the predicted midterm 
scores of the high, moderate, and low esteem groups revealed signifi- 
cant eflects for both spSE measures, but not for any of the GSE 
measures. For the inE measure, high esteem subjects did yield signifi- 
cantly higher predicted grades than did the moderate esteem subjects, 
4 (2, 156) = 3.70, p < .05, or than did low esteem subjects, 4 (3, 156) = 
4.10, р < .01. For the edE measure, higher grades were predicted for 
high esteem subjects than for the low esteem subjects, q (3, 156) = 
429, p < 01. 


Discussion 


The results of this study supported all four hypotheses. Significant 
correlations were found between global and specific measures and 
between specific and task-specific measures, but correlations between 
global and task measures were only of marginal significance. Task- 
specific measures did show a stronger relationship to academic per- 
formance than did specific measures which in turn had a stronget 
relationship to grades than did global measures. Although these find- 
ings are in agreement with Schrauger (1972), they point even more 
clearly to the predictive power of the task-specific measures. They also 
Corroborate many of the relationships between variables found by 
Morrison et al. (1973) with one important exception—global self- 
esteem was unrelated both to predictions of performance and to actual 
performance. 

To account for these findings it is necessary to examine the scales 
used to measure the different self-esteem constructs. At least four 
important criticisms have been made of global measures. First, Gergen 
(1971) has pointed out that global measures tend to overlook impor 
tant situational influences. When а global measure such as the Self- 
Esteem Inventory (Coopersmith, 1967) represents the specific area 
pertinent to the behavior under study, then a relationship between 
global self-esteem and behavior may be obtained as in Morrison et al. 
(1973). However, when an instrument such as the Tennessee Self 
Concept Scale does not represent a particular area or when it is not 
possible to determine how much subjects consider a particular area ІП 
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the completion of generally stated items such as those on the RbSE or 
QSE scales used in this study, significant relationships may not be 
obtained. 

А second issue concerns the weighting of different sources of self- 
esteem (Rosenberg, 1965). One person may base his global self-esteem 
more on intellectual-academic achievements, whereas another may 
base it more on social or characterological factors. Most global meas- 
ures, however, do not take these differences into account. The TSCS 
weights each of five areas equally, whereas differential weights on the 
RbSE and QSE scales are unknown, since subjects have the freedom to 
determine weights themselves. 

A third difficulty stems from the fact that the global self-esteem 
construct is so all-encompassing that it is hard to know what is being 
measured (Wylie, 1974). Global measures may represent literally thou- 
sands of behaviors in a wide variety of situations, and yet these behav- 
iors, as well as subjects’ values, standards, reference groups, and eval- 
uative criteria, are often unknown. These and other problems of 
measurement described by Wylie (1974) decrease the predictive power 
of many global measures. 

In contrast, these criticisms do not apply so much to specific and 
task-specific measures as to global measures. Specific measures as 
contrasted with global measures can focus upon more restricted areas 
of functioning or categories of behavior. However, these specific meas- 
ures are still rather broad because they represent, theoretically, the 
Weighted sum of evaluations for hundreds of different behaviors and 
Situations. Furthermore, the situational referents, behaviors, and 
weights for specific measures are unknown. Task-specific measures, on 
the other hand, can focus upon specific behaviors in a specific situa- 
tion. Thus, when esteem level is related to specific behaviors, it is easy 
to see how more specific measures would yield higher correlations than 
would global measures. 

A fourth criticism of self-esteem measure: 
global measures than to specific measures. Most ind 
vated to evaluate themselves positively (Rosenberg, 1965), a tendency 
which operates to bias their self-reports (Ziller et al., 1969). When 
subjects describe or evaluate themselves in a general setting ог fashion 
rather than in a specific situation or than in relation to specific Беһау- 
lors, they may be more likely to present themselves in a socially 
desirable, positive, or idealistic manner. However, when the eval- 
uation process is restricted or when subjects are confronted with their 
Own behavior in a specific situation, they may be less likely to furnish 
Socially desirable self-reports than they would in the global circum- 
Stances. This positive distortion in ratings may be one factor that 


may also apply more to 
ividuals are moti- 
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accounts in part for the findings that global self-esteem is related to 
predictions of performance, but not to actual performance. 

The findings of this study again question the utility of the global 
self-esteem construct and underscore the importance of several sugges- 
tions made by Wylie (1974). It is imperative that researchers first 
clearly define more specific kinds of self-esteem and then, making full 
use of the methodological expertise available, develop instruments to 
measure them. Global self-esteem measures should be developed or 
refined so that in representing major sources of self-esteem they may 
be weighted in terms of their importance to the subject. The construct 
validity of these instruments must then be established. Researchers 
using self-esteem measures would be well-advised to select the measure 
most nearly appropriate for their study: global measures may be pref- 
erable in some cases, but specific or task-specific measures may be of 
greater value in others. 
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THE VALIDITY OF SOME ALTERNATIVE MEASURES OF 
ACHIEVEMENT MOTIVATION 


FRANK B. W. HARPER 


University of Western Ontario, Canada 


Two tests commonly used in research on achievement motivation, 
the Thematic Apperception Test and the Test Anxiety Question- 
naire, have been criticized for a number of shortcomings involving 
reliability, validity, and ease of scoring. The present study examines 
the retrospective validity of two alternate measures which appear to 
overcome many of the objections to the former tests. These alternate 
measures are the n-Ach scale of the Personality Research Form and 
the Debilitating Anxiety scale of the Achievement Anxiety Test. An 
analysis of variance comparing the relative academic achievement of 
two samples of college students was performed through using high 
and low scoring comparisons on the two measures. The results 
showed that academic achievement was significantly related to scores 

on the tests. The alternate measures are therefore recommended to 
researchers for further study of achievement motivation. 


ACHIEVEMENT motivation theory as formulated by Atkinson (1964) 
derived from the The- 


has used the need achievement score (n-Ach) 
matic Apperception Test (TAT) as the principal operational measure 
of the motive for success Ms, and the Test Anxiety Questionnaire 
(TAQ) as the principal operational measure of the motive to avoid 
failure Mar. 
Clarke’s (1973) critique of the use of the TAT as a measure of need 
achievement and presumably therefore of Ms, and Harpers (1971, 
1974) critique of the TAQ as a measure of test anxiety and, therefore, 
of Mar, raise questions about the reliability and validity of these 
Measures. Alternative measures of need achievement which meet more 
stringent criteria of reliability than does the TAT are recommended by 
Clarke. In particular he has suggested the use of the need achievement 
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scale (n-Ach) of the Personality Research Form (РКЕ) (Jackson, _ 
1967), as a more suitable instrument. The PRF was designed to min- 
imize the effects of social desirability and acquiescence response sets in 
self-report personality inventories. Jackson reported reliabilities for 
the n-Ach scale of .72 to .86. A high scorer on this scale is described as 
an individual who aspires to accomplish difficult tasks, maintains high 
standards, and is willing to work towards distant goals; responds 
positively to competition; is willing to put forth effort to attain excel- 
lence. Correspondingly, a low scorer on this scale is described in 
opposite terms. 
Harper's review of the comparative validity of the two principal 
measuring instruments for test anxiety, the TAQ and the Achievement 1 
Anxiety Test (AAT) (Alpert and Haber 1960), recommended that - 
when anxiety about taking conventional college examinations or tests 
was the focus of the researcher's interests, the Debilitating scale (Deb) 
of the AAT be used (Harper 1974). Т 


Problem 


" 
It was the purpose of this study to ascertain the retrospective valid: - 
ity of one formulation of the achievement motivation hypothesis in а 
college population through the use of these two operational measures, 
the n-Ach scale of the PRF and the Deb scale. In reviewing Spielber- 
ger’s work on the relationship of Manifest Anxiety to grade point | 
average (GPA), (Spielberger 1962, Spielberger and Katzenmeyer, 
1959), Atkinson (1964, р. 255) suggested that groups formed from | 
individuals high or low in score distributions of n-Ach measures 0 
anxiety measures, should differ significantly in their academic attain 
ment as reflected in their overall grade point averages. Weiner (1972, / 
pp. 195-209) presented an hypothesis derived from Atkinson’s analy- у 
sis. Weiner developed the formal conceptual terms of achievement 
theory as follows: when the motive for success (Ms), which is opel 
ationalized as a score on a n-Ach measure, is greater than the motive t 
avoid failure (Мағ), which is operationalized as a score on a measure: 
of test anxiety, the individual should approach achievement-relati 
activities such as college examinations in a positive way. Conversely, 
when Ms is less than Mar, the individual should be more hesitant 
about approaching achievement-related activities and consequently 
should be less proficient in examinations. i 
The more sophisticated versions оГ achievement motivation theory | 
add to this formulation additional variables involving subjective 6507. 
mates of the probability of success and the incentive value of success 
which a retrospective study of this kind cannot calculate. The theory is 
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therefore tested in its simplest form as а measure of (һе possible effect 
on overall career grade point average of the relationship of Ms to Мағ. 


Subjects and Procedure 


Two samples of college students were studied. One was a sample of 
304 male graduates and the other a sample of 332 women graduates. 
Both samples were enrolled in a post-graduate teacher preparation 
program, after having completed the Bachelor’s degree in an academic 
field. Each subject was given the PRF and the AAT to complete at the 
beginning of the teacher preparation year. The overall career grade 
point averages for the samples were calculated from their respective 
college transcripts. 


Scoring and Analysis 


The subjects were assigned to one of four cells in a2 X 2 ANOVA 
design, according to their scores on the two scales, n-Ach and Deb. 
Subjects scoring higher than the median on both scales were assigned 
to the High-High cell. Subjects scoring lower than the median on both 
scales were assigned to the Low-Low cell. Subjects scoring above the 
median in n-Ach but below the median in Deb were assigned to the 
High-Low cell, Finally, subjects scoring below the median in n-Ach 
and above the median in Deb were assigned to the Low-High cell. The 
grade point averages for each subject were then entered into the 
ANOVA as the dependent variable, and the corresponding F values 
calculated in a conventional 2 X 2 design. 


Results 

The grade point averages for each cell of the 2 X 2 distribution are 
shown in Table 1, for each sex separately. 

The GPA values follow the sequence predicted by the research 
hypothesis. Subjects in the High Success (n-Ach) and Low Test Anx- 
iety (Deb) cell have the highest GPA, whereas subjects in the Low 
Success (n-Ach) and High Test Anxiety (Deb) cell had the lowest G PA. 
Subjects in the other two cells had GPA's which were midway in value 
between the other two cells. 

E 2 x 2 ANOVA was performed on 
* shown in Table 2. It can be seen in bot 
Were significant for the main effects of n-Ach and Deb. For the male 
sample a significant interaction effect did exist, whereas for the 


women’s sample a significant interaction was not present. 


the table. The values of F are 
h samples that the values of F 
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TABLE 1 


Grade Point Averages in Groups Differing in Four Combinations of 
n-Ach and Test Anxiety 


Males N = 304 
Success 
(n-Ach measure) 
‘ High Low 
Anxiety High 65.38 64.59 
(Deb measure) Low 68.16 65.00 
Women N = 332 
Success 
(n-Ach measure) 
High Low 
Anxiety High 68.48 66.34 
(Deb measure) Low 69.23 67.89 


The study was intended to ascertain the retrospective validity of two 
alternative measures of achievement motivation. The results show that 
in both samples, the alternate measures had significant relationships 
with career grade point average. Both the n-Ach scale of the PRF and 
the Deb scale of the AAT appear therefore to be worthy of further 
study às operational measures of the constructs Ms and Мар, respec- 
tively. The ease of administration and Scoring of these two measures 
compared to the difficulties in administering and scoring the TAT and 


Discussion 


the TAQ should recommend the use of them to researchers. 


Analysis of Variance 


Source 


Test Anxiety (Mar) 
N-Ach (Ms) 
Interaction 

Within 

Total 


Test Anxiety (Mar) 
N-Ach (Ms) 
Interaction 

Within 

Total 


TABLE 2 


of Grade Point Averages in Groups 


Differentiated in A chievement Motivation 
df Mean Square F P 
Males 
1 192.64 9.07 <.01 
: 296.05 13.95 <.01 
105.68 502 <.05 
300 21.22 
303 
Women 
1 109.88 4.85 <.05 
1 251.56 11.10 <.01 
1 13.52 0.60 ns 
328 22.67 


م — 
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RELATIONSHIPS AMONG FOUR MEASURES OF 
ACHIEVEMENT MOTIVATION 


THOMAS R. WOTRUBA 
Department of Marketing 
San Diego State University 


KARL F. PRICE 
Department of Management 
Temple University 


Short, objective tests of achievement motivation have considerable 
potential appeal to investigators for reasons of economy and ease of 
analysis. Two new paper-and-pencil tests of achievement motivation 
developed by Hermans and Mehrabian were examined to determine 
whether they might be comparable to two older measures, McClel- 
land's TAT n-Ach and the achievement scale of the Edwards Per- 
sonal Preference Schedule. The four measures were administered to 
65 undergraduate business administration students at San Diego 
State University. Although the results reflected a modest (30) corre- 
lation between the Hermans and the TAT n-Ach, no other significant 
correlations among pairs of the four achievement measures were 
found. The results lend support to past findings; namely, that the 
various achievement measures would appear to be measuring dis- 
similar constructs. 

THE purpose of the study was to investigate whether significant 
relationships exist among four measures of need achievement, Two of 
these, McClelland’s projective measure and the achievement (ach) 
scale of the Edwards Personal Preference Schedule (EPPS), have been 
subject to considerable previous study. The other two measures have 
been obtained from relatively new paper-and-pencil tests of achieve- 
ment motivation, Developed by Mehrabian (1968, 1969), the first one 
includes “verbal items which are designed to discriminate high versus 
low achievers,” (1968, p. 494). Separate male and female scales were 
devised, each with 26 items to be rated on a 9-point measure of 
agreement. Developed by Hermans (1970), the second one consists of 
Copyright © 1975 by Frederic Kuder 
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29 multiple choice items representing various aspects of achievement 
motivation found in the literature. 1 
Short, objective tests of achievement motivation have considerable 
potential appeal to investigators because of relative economy and ease 
of analysis. However, numerous studies of the relationships between 
various measures of achievement motivation have produced little in 
the way of positive relationships. For example, the achievement score 
(n-Ach) of McClelland's Thematic Apperception Test (TAT) showed 
no significant relationship with three Objective tests: Survey of Study 
Habits and Attitudes (SSHA), Opinion, Attitude, and Interest Survey 
(OAIS), and a 99 item How To Study test (HTS) (Krumboltz and 
Farquhar, 1957). In another study, the same TAT measure of n-Ach 
showed no significant correlation with self-reports or with self-peer | 
ranking measures (Holmes and Taylor, 1968). A number of studies 
have failed to produce a significant correlation between the ТАТ n- 
Ach score and the achievement scale measure on the EPPS (Himel- 
stein, Eschenbach, and Carp, 1958; Marlowe, 1959; and Melikian, 

1958). 

Thus, in addition to replicating previous studies concerning relation- 
Ships between McClelland's n-Ach scores and the EPPS achievement 
Scales, the present study adds information from two recently devised 
instruments to the pool of correlation data on achievement tests. 


Procedure 


The subjects were 65 undergraduate business administration stu- 
dents at San Diego State University. The McClelland TAT was admin- 
istered first; the EPPS one week later; and the two scales by Mehrabian 
and Hermans, about a week later, two days apart. Preceding all the 
tests, a demographic questionnaire was completed by each participant. 
_ The ТАТ were scored by both authors (Atkinson, 1958), with an 
interrater reliability of 88. The other three tests were scored according 


to procedures devised by each test’s author (Edwards, 1959; Hermans, 
1970; and Mehrabian, 1968, 1969) 


Results 


as shown in Table 1, reflect a modest but significant 
correlation between the TAT n-Ach and Hermans achievement meas- 
ures. None of the other five pairs of achievement measures produced а 
degree of relationship significant at the 05 level, although the correla- 
tion between the Hermans and Mehrabian measures could occur by 
chance at the .10 level, Nevertheless, both the Hermans and the Mehra- 


, 


i 


| 
| 
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TABLE 1 
Correlations between Study Variables 
EPPS TAT measure 
variables of n-Ach Hermans Mehrabian 
ach —.169 219 2215 
def —.018 -.149 -.010 
оға 1076 090 —.364** 
ехһ -.123 1001 026 
аш 200 —.192 —.094 
aff .059 —.262* —.029 
int .020 .082 .108 
зис 972 —.238 —.391** 
dom —.127 .291* .343** 
аба 107 -.025 —.067 
nur .189 —.090 148 
chg —.145 —.079 442 
end 205 .318** .076 
het O18 -.218 - 094 
agg -.224 1123 — 059. 
cons —.034 .190 .296* 
Mehrabian .134 234 
“р< 05. 
“р5 01. 


ntly with at least three of the EPPS 
variables other than achievement. Both shared a significant common 
variance with dominance (dom), and either one or the other showed a 
significant negative correlation with order (ord), succorance (suc), 
өп (aff), and a significant positive correlation with endurance 
end). 

When achievement measures are related to the subjects" demogra- 
phic and classification data, only four out of 28 pairs of relationships 
emerged with low but significant (р < 05) correlation coefficients. (See 
Table 2.) 


bian measures correlated significa 


TABLE 2 ў 
Relationship 10 Demographic Data 
Demographic or i 
Achievement Classification | E 
EPPS ach grade point averag AN 
Hermans grade point average TAR 
TAT n Ach bs 274 


ТАТ п-Асһ socio-economic group 
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Conclusion 


In general, although a few small correlations were found in this 


investigation, only one such relationship occurred between the achieve- 


ment tests themselves. Thus, the evidence presented continues to sup- 
port the general thrust of past findings, namely that the various 
achievement motivation measures are in fact measuring dissimilar 
constructs. Furthermore, the new paper-and-pencil tests, the Hermans 
and the Mehrabian, are apparently not valid substitutes for each other 
or to any large extent replacements for other achievement measure- 
ment techniques. 
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PREDICTION OF PERSISTENCE AND PERFORMANCE WITH 
THE HERMANS PRESTATIC MOTIVATION TEST' 


J. OGDEN HAMILTON* 
Indiana University 


Hermans’ Prestatic Motivation Test, а questionnaire measure of 
achievement motivation, is easier to administer and to score than are 
its projective counterparts, the Thematic Apperception Test and the 
French Test of Insight; and it need not be administered under con- 
trolled conditions. In two independent studies of its predictive valid- 
ity, Hermans’ measure was found to be positively related to per- 
sistence and to performance in academic examinations, both when 
the measure was used alone and when it was combined with the 
Mandler-Sarason Test Anxiety Questionnaire 454 measure of result- 
ant motivation. Moreover, although the French Test of Insight was 
found to be related to persistence as in earlier research, it was not 
related to Hermans’ measure. It is concluded that Hermans’ ques- 
tionnaire taps a psychological characteristic that is. manifest in 
achievement directed behavior, but that this characteristic is some- 
thing other than the achievement motive of McClelland and his 


colleagues. 


THE purpose of this investigation was to examine the degree of 
relationship between achievement motive measured by Hermans' Pre- 
static Motivation Test (PMT) and measures of persistence and per- 
formance in academic tasks. The measure, which consists of 29 Gutt- 
man-scaled items, amounts to à self-report measure of attitudes and 
behaviors previously shown to be related to the achievement motive. It 


i ivisi Research 
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was described in detail by Hermans (1970). Its principal strength is 
logistical: As a questionnaire, it does not require the controlled condi- 
tions or the scoring expertise needed to use the best known projective 
measures of the achievement motive—the Thematic Apperception 
Test (TAT) (McClelland, Atkinson, Clark, and Lowell, 1953) and the 
French Test of Insight (FTI) (French, 1958). Because this flexibility is 
valuable, even indispensable in field research, the PMT seemed to 
merit validation beyond that initially reported by Hermans. This study 
was undertaken to ascertain whether the PMT predicts the same sorts 
of behavior as do the projective measures. If it does, then one can be 
more confident than previously in designing and interpreting future 
research with the measure on the basis of the wealth of knowledge 
about the achievement motive accumulated by McClelland and his 
followers. 

To maintain consistency with the motivation literature, the PMT 
was investigated both alone and in combination with the Mandler- 
Sarason Test Anxiety Questionnaire (TAQ) (Mandler and Sarason, 
1952), a measure of fear of failure. The TAQ often has been used with 
сасһ of the common projective measures to tap a joint construct, 
resultant motivation. In theory, the more the achievement motive of 
an individual exceeds his fear of failure, the higher his resultant moti- 
vation and the more likely he is to engage in achievement seeking 
behavior. Atkinson and Litwin (1960) have demonstrated that a meas- 
ure of resultant motivation can be a more accurate predictor of behav- 
lor than is achievement motive alone. In the present studies, resultant 
motivation was measured by subtracting the rank on the TAQ from 
the rank on the measure of achievement motivation, a high rank 
always indicating a high score on a measure. 


Study 1: Persistence 


As an accepted behavioral manifestation of the achievement motive, 
Persistence has been related to both common projective measures of 
the Motive: the TAT by Feather (1961) and the FTI by Atkinson and 
Litwin (1960). Since the present study was intended simply to deter- 
mine whether the PMT could be substituted for the projective meas- 
ures used in earlier studies, the two-fold hypothesis offered was the 
same опе as in these earlier studies: namely, that persistence in а 
difficult task would be Positively related to the achievement motive 


measured by the PMT as well as t ivati sured by 
the PMT and the TAQ. $ to resultant motivation mea: 


| 
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Method 


The 41 subjects were male undergraduates in a behavioral science 
course. All the data were collected in connection with the pedagogical 
requirements of the course. To permit direct comparison of the PMT 
with one of the standard projective measures, the FTI was adminis- 
tered in addition to the PMT and the TAQ. The FTI was scored by 
two experienced specialists who achieved satisfactory reliability (ғ = 
88). 

The РМТ consisted of the 29 items used by Hermans (1970). The 
measure was scored as described by Hermans: i.e., those above the 
median on each item received a score of one for that item, and those 
below, a zero. Contrary to Hermans’ procedure, those scoring at the 
median were given a fractional score derived from the proportional 
allocation of the median rather than an arbitrary zero or one: 


n/2 — no. below median 
STII a A TE (1) 
no. tied at median 


Fractional score = 1 — 


scoring, for it is more sensitive 
tinuous distribution of scores 
Iculates the likelihood that 
fact lies above the true 


This procedure is superior to Hermans’ 
to the assumption of a theoretically con! 
on each item. In effect, the modification са 
any given observation tied at the median in 
median. Е 

Persistence was measured by the length of time spent in the course 
final examination, with no time limit. This is the measure that was 
used by Atkinson and Litwin (1960), the assumption being that the 
higher the achievement motive of а student, the longer he spends 


attempting to do well on the examination. When а student finished the 
final examination and left the room, the experimenter noted the time 
al security number on 


On a card and asked the student to enter his soci 


pose i d to the 
Because a student's verbal ability reasonably might be related to 


length of time he spends on an examination, scores on isa 
section of the Scholastic Aptitude Test were held constant statistically 
during the data analysis. 


Results 
i ivati d on the РМТ was signifi- 
As predicted, resultant motivation base pur qe gente 


cantly related to persistence (r = 28, df = Я 
Ship between persistence and the PMT alone was not quite so strong 
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as that between persistence and the measure of resultant motivation 
(r = 21, df = 38, p < .10). The relationships between persistence and 
both the FTI and the corresponding measure of resultant motivation 
also were positive, but neither achieved the chosen level of signifi- 
cance. The PMT and the FTI were virtually uncorrelated (r = .07). 


Reanalysis and Results 


The lack of a significant relationship between the FTI and the time 
spent in the examination prompted investigation of the validity of the 
scores on the ЕТІ. The original Atkinson and Litwin study used a less 
conservative analysis than the one just described, in that it involved 
only subjects with extreme motive strengths. Specifically, subjects scor- 
ing both above the median in achievement motive and below the 
median in fear of failure (i.e., high resultant motivation) were com- 
pared to those scoring both below the median in achievement motive 
and above the median in fear of failure (i.e., low resultant motivation). 
Subjects scoring above the median on both measures or below on both 
were eliminated. Reanalysis of the data of the present study using this 
procedure resulted in a precise replication of the original study and put 
the FTI onto its home ground, so to speak. 

Using this procedure, it was found that both the PMT and the FTI 
Showed the predicted relationship to Persistence. For the PMT, the 10 
Subjects classified high in resultant motivation spent an average of 99 
minutes in the examination, compared to 76 minutes spent by the 10 


Study II: Performance 


Performance as well as Persistence is an accepted behavioral mani- 
festation of the achievem 
Litwin (1960) have demon: 


S У was the same one as in these earlier 
studies: namely, that Performance in an achievement task would be 
Positively related to the achievement motive measured by the PMT as 
well as to resultant motivation measured by the PMT and the TAQ. 
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Method 


The 24 subjects were students in an undergraduate behavioral sci- 
ence course. The specific class was chosen because of the instructor’s 
grading policy. Prior to taking an examination students were given a 
list of possible questions from which the questions used on the exam- 
ination were selected. Thus, the course grade for a student depended 
on the effort he was willing to devote to preparing answers to all the 
possible questions, and intelligence and verbal ability were less impor- 
tant than often would be the case. The course final grade was the 
measure of academic achievement. 

The course instructor administered the PMT and the TAQ in con- 
nection with the pedagogical requirements of the course. He was not 
aware that the score from these tests would be used for research; 
moreover, he never saw the test scores. No projective measure was 
` administered. To control for verbal ability, the Verbal scores on the 
Scholastic Aptitude Test were held constant statistically during the 
analysis. 


Results 


As predicted, the score on the PMT was significantly related to 
performance (r = .54, df = 21, p < 005). The relationship between 
performance and resultant motivation was not quite so strong às that 
between the PMT alone and performance (ғ = .29, df = 21, p <.10). 


Discussion 


These two studies suggest that the PMT indeed measures a psycho- 
logical characteristic that manifests itself in achievement-seeking be- 
havior. At this point, however, it is not clear whether the correspond- 
ing measure of resultant motivation is a more accurate predictor of 
behavior than is the PMT alone, for the data in these studies did not 
permit definitive comparison of the two у И А clear determina- 
tion of this issue requires further research. en 

On the basis of the reanalysis of the persistence data, in пт i 
Procedure of the original Atkinson and Litwin study jl n Mee 
ment motive and persistence was used, it seems reasona үү се 
that the FTI was validly administered and scored, and А nat i ie 
related to persistence in the same way as it was In the At d а ; 
Litwin study. Since the PMT did show nearly the same re о 
With the same kind of behavior of the same sample as did the FTI, | is 
Teasonable to believe that it, 100, would be a valid measure о a 
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predisposition to achieve, Therefore the very low correlation between 
the two measures can be interpreted with some confidence as in- 
dicating that although each measure taps a psychological character- 
istic that is manifest in achievement-seeking behavior, the two meas- 
ures do not represent the same characteristic. 

In sum, it is concluded that the PMT promises to be a valuable tool 
for studying achievement-related behavior, especially when use of the 
standard projective measures is impractical. However, it must be real- 
ized that since the PMT seems not to measure the achievement motive 
of McClelland and his colleagues, one should not make indiscriminate 
use of those aspects of the achievement motive literature that have not 
been explicitly studied with the PMT. The present studies have at once 
shown the promise of the measure and also the critical importance of 
further step-by-step establishment of its construct validity. 
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A PRELIMINARY VALIDATION OF AN INSTRUMENT TO 
MEASURE THE DEGREE OF COUNSELOR RESTRICTIVE- 
NONRESTRICTIVE COGNITIVE ORIENTATION 


THOMAS A. SEAY AND Е. TERRILL RILEY 
Kutztown State College 


The present study was designed to provide a source of empirical 
validity for an instrument which measures the degree of cognitive 
functioning of a counselor along à restrictive-nonrestrictive dimen- 
sion, The restrictive-nonrestrictive dimension refers to a holistic ori- 
entation as a mode for experiencing life by the receptivity toward the 
processing of and responding to sources of internal and external 
stimuli, To establish validity for the instrument, one hundred eleven 
counselor trainees in different phases of a training program designed 
to produce open and humanistic counselors were compared on the 
Counselor R Scale. The Rokeach Dogmatism Scale was included for 
analysis, since it was thought to be a component of the restrictive- 
nonrestrictive dimension. In the use of a 2 x 4 analysis of variance 
design for unequal n’s, the study provided data supporting the hy- 
pothesis that counselors in different phases of their training would 
differ in their scores on the Counselor R Scale. As a trainee pro- 
gresses through a humanistically oriented training program, he or she 
can be expected to move from the restrictive to the nonrestriclive 
ends of the measured dimension. Conclusions, implications, and 
future research potential were described. 


IN a comprehensive review of the theoretical and research а 
On the characteristics of effective counselors, Shertzer am on 


(1968) concluded that “аї the present time, the counseling ошоп 
is unable to demonstrate consistently that а single trait or paler of 
traits distinguishes an individual who is or will be a 'good Sue 
(рр. 170-171). In a similar review, Brammer and Shostrom (1968) 
Teached the same conclusions. Thus, although numerous pee 
istics have been identified, little consistency exists 1n the empirica 
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conclusions to justify their use in the selection or training of coun- 
selors. However, a re-examination of the research literature indicates 
that such pronouncements may be premature. 

If the empirical findings are reordered using a different perspective, 
asingle domain or dimension of traits emerges that could be character- 
ized as a holistic orientation toward cognitively arranging and re- 
arranging one's processed internal and environmental stimuli. Such a 
dimension amplifies what the cognitive and affective receptivity of a 
counselor to his encounter with life is and indicates how others per- 
ceive and respond to that encounter. The counselor's receptivity is 
thought, at this point, to originate from a developmental, phys- 
iological cognitive processing system. 

If a pattern of characteristics emerges from the literature which 
allows conceptualization by means of a single dimension, then it be- 
comes necessary to devise an instrument to assess the parsimony of 
behaviors coterminous with the conceptualization. The Counselor R 
Scale is an instrument which is intended to measure a restrictive- 
nonrestrictive dimension of counselor functioning. The dimension re- 
fers to the counselor's cognitive set or orientation for perceiving and 
processing organismic and environmental stimuli. The cognitive set 
leads to behaviors which indicate the degree to which the counselor is 
open and receptive toward people, things, and events that enter the 
counselor's frame of reference. 
| Тһе cognitively nonrestrictive counselor as compared with one who 
is a restrictive counselor will tend to be more open and flexible in 
processing incoming stimulus events and, thereby, not only will re- 
main open in his receptivity but, in addition, will actually expand his 
Cognitive substructures. This cognitive activity, in turn, influences his 
behavior toward a nonrestrictive or appropriately restrictive mode of 
behaving. The cognitively restrictive counselor will tend to be just the 
reverse. In both instances, the behavioral mode will be reflected in the 
counselor's counseling orientation, approach, relationship, and use of 
selected counseling skills. 

The present study sought to validate the Counselor R Scale as 4 
measure of the dimension described. Because of the theoretical nature 
of the dimension, it should reflect changes in counselors-in-training 25 
а result of entering a graduate program which emphasizes self and 
professional development toward a humanistic, open approach to 
Counseling. Consequently, it was hypothesized that the R scale would 
reflect entry and progression through such а program. Of secondary 
Interest was a potential male-female difference on the R scale. 

An additional emphasis in the validation of the R scale was its 
correlation with the Rokeach Dogmatism Scale. Previous theorizing 
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and research by Rokeach (1960) indicated that dogmatism should be a 
component in the present dimension. Based on the previously stated 
expectation three Dogmatism items were found to be consistent with 
the present dimension and were included in the final construction of 
the В scale. Thus, because of expectations from theory and research 
and because of the inclusion ОЁ Dogmatism items, there should be a 
small positive correlation between the two scales. 


Method 


Subjects 


The subjects for the present experiment were 111 graduate students 
enrolled in various stages of à counselor education program ata small 
eastern college. The stages of program participation and, thus, the 
identification of experimental groups corresponded to four divisions 
derived from the number of graduate credit hours completed in the 
program. The four subgroups were designated as follows: (a) Admitted 
Students, accepted to the program but not enrolled in courses; (b) 
Beginning Students, with three to 12 credit hours earned in “соге” 
courses; (c) Mid-way Students, with 15 to 24 credit hours completed; 
and (d) Finishing Students, with 27 to 39 credit hours accumulated. 
The subjects were heterogeneous in composition às might be expected 
in a counselor preparation program. Heterogeneity Was maintained 
within each of the four groups and across those subjects who also had 
completed the Dogmatism instrument (п = 71). 


Instrumentation 


Counselor R Scale. Since à theoretical description was presented 
previously, only the essential characteristics of the scale are discussed. 
Construction of the Ё scale followed the recommendations of ih 
nally (1967) for developing an index of internal test validity Ton 
the unidimensionality of the construct being measured. Coe rie 
alpha for the R scale produced an internal consistency index of .84 (п 


= 107), more than sufficient to suggest the existence of à relatively 
homogeneous measure of t 


he construct OT complex of Е 
Respective test-retest estimates of reliability of 84 (n = 64) and 76 (n 
= 60) were obtained over а two-w 


сек period and, subsequently, over a 
one to four month period. The R scale uses a Likert-type ond 
continuum in which high scores are indicative of restrictiveness 
low scores reflect nonrestrictiveness. The mean ап 


d standard deviation 
for the norm group were found to be 124.62 and 23.52, respectively. 
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The skewness and kurtosis values of .15 and —.53, respectively, Sug- 
gested the presence of a relatively normal distribution of Scores. 

Rokeach Dogmatism Scale. The Rokeach (1960) scale is intended to 
measure the belief system of an individual. Because of its higher 
reported reliability (r — .91), Version D (66 items) rather than another 
form was used in the present study. 


Description of the Program 


The present validity study rests on the assumption that the coun- 
selor preparation program would enable students to move toward an 
open, humanistic life style orientation in counseling. To achieve that 
goal, the program was composed of two components: (a) self-devel- 
opment through self-awareness; and (b) professional skills devel- 
opment through competency-based preparation. 

The self-development component was sought through formal and 
informal self-analysis and through systematic departmental faculty 
analysis, The primary Purpose of the assessment procedures was to 
identify characteristics associated with effective and ineffective coun- 
seling which would enable the students by self-direction and faculty 
assistance to remediate weaknesses and to build on strengths. 

Program components for professional skills development included 
both didactic and experiential learning with emphasis on development 
of a knowledge base, Personal relationships, verbal and nonverbal 
skills, and total Counseling strategies, For purposes of both self-devel- 
Opment and professional development, much emphasis was placed on 
tape analysis during all phases of the program of study. The ultimate 
goal was the fully functioning counselor who would be cognitively 
aware of, receptive toward, and adaptive to a state of experiencing and 
who would have the ability to develop that state of experiencing 45 
a tool for effective counseling. The departmental members, all of 
whom have been Tecipients of doctorates from major universities, 


professed a humanistic orientation, although different schools of 
thought have been represented. 


Research Design 


The data for the preliminary validity study were analyzed through 
using a 2 X 4 analysis of variance design for unequal и’ (Winer, 1962). 
Selected relationships between the means were analyzed by the Scheffe 
method of using orthogonal contrast coefficients for a priori data 
analysis (Winer, 1962, рр. 88-89). The correlation coefficient between 
the Dogmatism and the R scale was the Pearson ғ (Nunnally, 1967). 


д 
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TABLE 1 
Mean Raw Scores for Female and Male Students in Four 
Positions within a Counselor Education Program 


Sugroupl Subgroup2 Subgroup 3 Subgroup 4 
Admitted Beginning — Mid-Way Finishing 
Students Students Students Students 


Female 136.15 119.75 116.25 109.37 121.54 
Male 145.16 122,85 141.37 109.57 131.31 
141.38 121.08 128.81 109.46 
Results 


The analysis of variance summarized in Table 2 reveals that there 
were significant differences among the means of the eight subsamples 
formed by the inclusion of counselor sex with each of four positions of 
participation in a counselor education program, F (7, 103) = 5.99, р < 
001. On the other hand, the differences between means associated 
with sex of the subjects (Factor A) reached significance at slightly less 
than the .05 level, but not at the .01 level, F (1, 103) = 4.09, p < 105. 
Differences between means as а function of program position (Factor 
B) attained significance considerably beyond the .001 level, F (3, 103) 
= 10.19, p < .001. However, the interaction variance between Factors 
A and B failed to reach significance, F (3, 103) = 147, p > 05. 

Table 3 reports the a priori analysis of selected differences between 
the mean raw scores for the four program positions. The a prior! 
assignment of orthogonal contrast coefficients reveals that the com- 
parisons and results were as follows: 

1. The difference between the mean 0 
Students and the composite mean 0 
groups of students in different positions Was 
cant, F (1, 103) = 13.62, p < .001. 


f the subgroup of Finishing 
f the remaining three sub- 
statistically signifi- 


TABLE 2 
Analysis of Variance for Counselor R Scale Raw Scores 
Р. 
Source of Variance SS df MS F 
07 
Between all Subsamples 15,996.21 2,285.17 qid 0000 
А (Sex) 1,564.10 1 ТЕДІ 40, 000006 
B (Position) 11,658.69 3 qum ҮЙ 2M 
AX B (Interaction) 1,690.12 3 563. { 
Within all Subsamples 39,254.97 103 381.11 
* Significant at p < 05: Е (1, 103) = 39V ТЕРІ 


** Significant at p < .01: Ем (7, 103) = 27% Ею (3, 103) = 3.95: Fe 
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TABLE 3 
Orthogonal Contrast Among Means in Relation to Four Positions 
(Factor B) in the Counselor Education Program 


Designated Contrast MS £ P 
Vau — kı + pa + us =0) 5190.92 13:62%% 0003 
S 
Va (i — ua + my = 0) 5098.82 13.38** .0004 
2 
Va (ka — us = 0) 719.88 1,89 .1700 
Within variance 381.11 


"^P < 001; Fo (1, 108) = 6.85, 


2. The difference between the mean of the subgroup of Admitted 
Students and the combined mean of the two subgroups of Begin- 
ning Students and Mid- Way Students was Statistically significant, 
F (1,103) = 13.38, p « .001. 

3. The simple difference between the mean of the two subgroups of 
Beginning Students and Mid-Way Students was not statistically 
significant, F (1, 103) = 1.89, p > .05. 

The Rokeach Dogmatism Scale correlated. .31 with the Counselor R 

Scale. The percentage of variance common to both instruments was 


9.6% (r* = 096). Thus, 90.4% of the variance was unaccounted for ог 
unexplained, 


Discussion and Conclusions 


more restrictive and thus, less open and flexible in their mode of 
behaving as measured by the R scale than were members of the remain- 


ing three subgroups, The Be inning and Mid-Way subgroups Were 
found to be similar in i: s ; - : 


This last finding could 
high score of the males in the Mid- Way subgroup. The less restrictive 
subgroup was the one composed of those subjects who were com- 


pleting their training. In addition, if the two middle subgroups are 
Combined, as indicat 


linear trend from ге: 
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least restrictive. Thus, the Finishing Students who would be expected to 
demonstrate the more cognitively flexible processing system, as well as 
the more open, receptive attitude toward whatever they had been 
processing, did score on the average toward the nonrestrictive end of 
the dimension. 

Important in relation to the theoretical meaning assigned to the R 
scale was the lack of a highly significant male-female difference on the 
restrictive-nonrestrictive dimension. Theoretically, it might be possible 
to infer that the male and female counselors studied tended to struc- 
ture their perceptions in cognitively similar ways. In terms of utility, 
the R scale appears to be almost equally viable for male and female 
counselors. 

Only 9.6% common variance was shared by the Dogmatism Scale 
and the R scale. Thus, although each scale appeared to contain a small 
component of the other, each was a unique measure of a different 
construct. The finding substantiated expectations. 

Since the findings of the present study support the К scale as a 
measure of counselor behavioral restriction and by inference as an 
indicator of cognitive restriction, it now becomes possible to link the 
restrictiveness-nonrestrictiveness dimension to other behaviors in 
counseling and in other social relationships. In addition, the dimen- 
sion should be explored in terms of selection, training, and selected 
counselor and client characteristics which determine effectiveness. It is 
anticipated that the nonrestrictive counselor would be the more effec- 
tive counselor and that during the counseling process the cognitively 
Nonrestrictive counselor is adaptive enough to be both nonrestrictive 
and appropriately restrictive. The parameters and implications of this 
concept must be examined. Ultimately, the R scale should be con- 
nected to the way an individual cognitively structures his internal 
world. Finally, the viability of the instrument for different populations 
Such as teachers and ministers should be investigated. 

In conclusion, the Counselor A Scale сап be said to have face, 
content (unidimensionality), and empirical validity. It now MESES 
possible to attempt to determine the limits of its usefulness as a meas- 
ure of counselor characteristics and, at the same time, expand the 
empirical validity as an instrument which measures cognitive function- 
Ing, 
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HIGH SCHOOL TYPE, SEX, AND SOCIO-ECONOMIC 
FACTORS AS PREDICTORS OF THE ACADEMIC 
ACHIEVEMENT OF UNIVERSITY STUDENTS 


JOHN F. McDONALD 
University of Illinois at Chicago Circle 


MICHAEL S. McPHERSON 
Williams College 


Grade point average was predicted for a sample of 152 students in 
Principles of Economics classes at the University of Illinois at Chi- 
cago Circle. It was shown that knowledge of high school type, sex, 
number of credit hours taken, and perhaps dollar value of scholar- 
ships and number of hours of outside work could significantly in- 
crease the ability to predict grades beyond that accomplished 
through using rank in high school class and American College Test- 
ing Program (ACT) Composite Score. 


IN recent years many studies have been undertaken with the purpose 
of finding variables which increase the predictability of college grades 
beyond that accomplished by using a measure of high school perform- 
апсе and a score on à standardized scholastic aptitude or achievement 
test. For example, measures of high school quality have been devel- 
oped by Bloom and Peters (1961) and Loeb and Mueller (1970). Tests 
to measure attitudes and study habits have been developed by Holtz- 
man, Brown, and Farquhar (1954), and socio-economic variables have 
been added by Barger and Hall (1965). 

The purpose of this study was to eva 
of several variables which measure som 
ity, motivation, study time, and ѕосіо-есопоті 
Ness of these variables in forecasting academic ac 
à simple linear regression framework. 


luate the predictive usefulness 
е aspects of high school qual- 
с status. The effective- 
hievement is tested іп 
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The data for the study were obtained by administering a question- 
naire to a sample of students in the Principles of Economics classes in 
1973 at the University of Illinois at Chicago Circle (UICC). Table | 
lists the variables along with their means and standard deviations. All 
data pertain to the academic quarter just prior to the time at which the 
questionnaire was administered. The sample consisted of unmarried 
white students, except for four married and two black students. Dele- 
tion of these individuals from the statistical analysis to be presented 
does not alter the conclusions. Students on the GI Bill were excluded 
because of their unusually high ages and levels of outside financial 
support. After results with a smaller sample had been obtained, the 
sample was expanded to its present size of 152 students. 

A brief discussion of the hypotheses associated with each of the 
important additional variables follows. Type of high school attended 
(central city or suburban and public or parochial) was used as a simple 
proxy measure of high school quality. Most students at UICC have 
Worked for compensation. It was hypothesized that the greater the 
number of hours of outside work the lower would be earned grade 
point average (GPA) because study time per credit hour would be 
reduced and because students who worked more might have a weaker 
motivation for success in college. It was also hypothesized that scholar- 
ship support should increase grades because students who obtained 
Scholarships exhibited some motivation for success in college. Finally, 
It was hypothesized that credit hours taken might be positively ог 
negatively associated with GPA because students who took more 
credit hours might show more Motivation but also probably studied 
less per credit hour. 

Grade point average (GPA) for the previous academic quarter was 


measured on а scale in which A = 500, B = 400, C = 300, D = 200, 
and E = 100, 


Results 


The Correlation matrix is presented in Table 1, and the multiple 
regression results are shown in Table 2. The row in Table 2 labeled 
regression No. 1 shows the result of regressing GPA on the American 
College Testing Program (ACT) Composite Score and percentile г ank 
in high school class, The multiple correlation coefficient for this regres 
Sion was .295, somewhat lower than the .50 found by Loeb and Muel 
ler (1970) in a similar regression for a sample of UICC students. In 
regression analysis No, 2 а set of dummy variables indicating type 9 


931 


6E = 
5/5 СС! 
ӨРТІ 5091 
0910 06 
L6V 
£10" 
sil 
£97" 
єє 
ЗУ 81 SP BL 
19% SET 
E9 9196 


pe 90::-10— 
ORBE 81 
225) 96 


SH 419-U0N 721 

uaye L SINOH UPAD 711 
x99 М 194 рәхңзод\ SMOH "OT 
(2әјзепЬ /$) sdiysiejoysg 76 
(aew = 0 ‘әјешәу = 1) XIS 7% 
жәту ойғоц> илом SH `L 
SH Iegoo1eq ueqingns `9 
SH эцапа ueqinqns `5 

SH Ітцоовға AND ^p 

Sse[) SH Juey opuoozad "€ 
2109$ 91s0du10) LIV ‘С 
әЗеләлу шод 9PPID ‘1 


"^q “PIS ual (с) 


McDONALD AND McPHERSON 


а) о) (9 (9 (0 (о (9 (p (9 (0 (Qn 


хүлер uone 


so|qeue, 


(251 = М) 
$uo1]D[24407) pup ‘suonviaag p4vpuvig ‘sura ‘suonuyfag 2140404 
1 dH VIL 


932 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


TABLE 2 
Regression Analysis of Grade Point A verages 
(Prediction Variables Included are Numbered in Table 1) 


Identification 
of the 
Regression 
Analysis Regression Equations and Corresponding Multiple Correlation Coefficients 
No.1 GPA = 274,7 + 4.169 (Var. 2) + .496 (Var. 3) Multiple R = .295 
(7.74) (2.93)*** (1.78)* 
No.2 GPA = 285.9 + 3.395 (Var. 2) + 691 (Var. 3) + 1.395 (Var. 4) + 22.03 (Var. 5) 
م‎ (7.74) (2.31)** (2.34 )** CH) (1.58) 
+ 41.94 (Var. 6) + 18.58 (Var. 7) Multiple R = .365 
(2.43)** (41) 
No.3 GPA = 258.9 + 3.219 (Var. 2) + «783 (Уаг.3) + 41.12 (Var. 12) + 38.35 (var. 8) 
(7.47) (2.42)** (1.89)*** (3.91)*** (3.13)*** 
+ .044 (Var. 9) — .665 (Var. 10) + 1.924 (Var. 11)Multiple R = .518 
(1.88)* (1.59) (2,31)** 


Note.—The t statistics are іп parentheses below the coefficient estimate, The designations *, **, and *** indicate significance at the 


Tesults indicate that graduating from a suburban parochial school 
added significantly to grades (42 of a letter grade), and that the 
suburban public school background might add to GPA. The increase 
іп explanatory power over regression No. 1 was highly significant (Е = 
118.5). It should also be noted that the coefficient of high school rank 
was estimated with more precision in regression analysis No. 2 than in 
regression analysis No. 1. In regression analysis No. 3 the socio- 
economic variables were included and the high school type variable 
was used as a dummy variable which indicated that the student gradu- 
ated from a high school not in the city of Chicago. The results revealed 
that female students did achieve significantly higher grades than male 
Students did (38 of a letter grade), students who took more credit 
hours made slightly higher grades than those who took fewer credit 
hours, students who worked might receive slightly lower grades than 
those who did not work, and that students with scholarship support 
might earn higher grades than might those without such support. The 
hypothesis that the relationship between GPA and hours of work was 
linear could not be rejected (Е = 1.90 with 1 and 134 degrees of 
freedom). Additional Variables were tested in regression analysis not 


Teported here, These variables, which included the student's age, year 
in school, number of siblings and the р 


д arents’ income and educational 
attainment, were all stai 


tistically nonsignificant. 


Interpretation 


ae for sex, none of the standard measures of socio-economic 


us mad 


210, .05, and .01 levels, respectively, for a two-tail test. | 
high school attended was added to regression analysis No. 1. The 
e a significant Contribution to the prediction of college | 
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success. This failure might stem from the fact that a biased sample of 
college students with high status parents had been observed. Because 
UICC is an inexpensive state university, many parents of high socio- 
economic status may prefer to send their more highly motivated chil- 
dren to more expensive and higher-status colleges. In addition, female 
students at UICC may earn higher grades because they exhibit 
stronger motivation or because they are self-selected according to 
some ability characteristic which has not been measured. Further 
validation of the results with new samples is warranted. 
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RELATIONSHIPS OF SELECTED NONACADEMIC AND 
ACADEMIC VARIABLES TO THE GRADE POINT AVERAGE 
OF BLACK STUDENTS' 


SHIH-SUNG WEN AND ROSE E. MCCOY 
Jackson State University 


An analysis of correlations of the data mainly from 164 male and 
202 female black undergraduate students indicated that (a) а 
weighted set of measures of manifest needs (Edwards Personal Prefer- 
ence Schedule) correlated significantly with the grade point average 
(СРА) for the males (R = .53, df = 15/148, F = 3.84, p < 001) but 
not for the females (В = .30, df = 15/186, Е = 120, ns), (b) а 
weighted composite of measures of personal problems (Mooney 
Problem Check List) correlated significantly with the GPA for both 
the males (А = .47, df = 11/152, = 3.86, p < .001) and the females 
(В = .36, df = 11/190, F = 248, p < 005), (c) manifest anxiety 
(Taylor Manifest Anxiety Scale) correlated significantly with the 
GPA for the male students only (7 = —.22, df = 162,р < 1005), and 
(d) the scholastic aptitudes (American College Testing) correlated 
significantly with the GPA for both male (R = .48, df = 5/108, Е = 
SINE < .001) and female students (R = 50, df = 5/139, F = 9.34. p 

01). 


ЗторЕз have impressively demonstrated that certain variables are 
associated with higher or lower academic achievement than would 
have been predicted from intelligence tests alone. There is considerable 
evidence that students of high intellectual ability sometimes fail or 
drop out of college; thus, a measure of intelligence is not a sufficient 
forecaster of college success. 

Behavior analysis of the college experience 
academic variables are highly relevant to the qual à 
demic performance. Quantitative indication of the learner's motiva- 
gı The study was supported by the Research and Publication Committee at Jackson 

late University. 
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indicates that non- 
lity and rate of aca- 
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tion, anxiety, personal problems, among others, have been widely 
accepted as important determinants of college success. 

The purpose of this study was to ascertain for each of two samples 
of 164 male black and 202 female black students in a southern state 
university the degree to which selected nonacademic and academic 
characteristics of college students as indicated by self-report scales and 
standardized measures of scholastic ability were predictive of college 
success as revealed by grade point average (СРА) earned during the 
first two quarters of the 1972-73 academic year. Specifically, for each 
sample the degree of correlation was sought between GPA and (a) 
each of the 15 measures of manifest needs within the Edwards Per- 
sonal Preference Schedule (EPPS) (Edwards, 1959); (b) each of the 11 
measures of personal problems within the Mooney Problem Check 
List (MPCL) (Mooney and Gordon, 1950); (c) the single measure of 
anxiety obtained by the Taylor Manifest Anxiety Scale (TM AS) (Tay- 
lor, 1953); and (d) each of five measures of scholastic ability provided 
by the American College Testing Program (ACT) including English, 
Mathematics, Social Studies, Natural Sciences, and Composite Score 
(1968-1971). In addition, multiple correlation coefficients were also 
determined between the GPA and each of the weighted composites of 
measures of the EPPS, MPCL, and ACT. 

The sample for the study was drawn from the student population of 
а predominantly black state university. Since less than 5% of the 
student enrollment is nonblack, and since the state in which the univer- 
sity is located is largely rural, it could be assumed that the majority of 
the student population has come from relatively depressed environ- 
ments, and further, that achievement motivation, personal adjustment, 
and other relevant factors have been unfavorably shaped by the expe- 
riences in these environments. 

In a study of motives, Brazziel (1964) reported that both lower and 
middle class black females in the lower-south manifested high needs 
for achievement. However, Williams and Cole (1969) reported apathy 
and low morale toward academic achievement in black students. Atch- 
inson (1968) found a significant positive correlation between Manifest 
Anxiety Scale scores and grade point averages for black college soph- 
omores. The correlation between the ACT assessment and college 


Brades was .59 among freshmen at a southern black state college 
(Funches, 1965). 


Method 


Subjects were 164 male and 202 female students in sections of 


1, poches Their ages ranged from 18 to 23 with the average 
about 19. 


в“ 
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Over a period of two quarters of the academic year 1972-1973, the 
EPPS, MPCL, TMAS were administered to students during the regu- 
lar class periods. The EPPS consists of 15 subscales: Achievement 
(Ach), Deference (Def), Order (Ord), Exhibition (Exh), Autonomy 
(Aut), Affiliation (Aff), Intraception (Int), Succorance (Suc), Domi- 
nance (Dom), Abasement (Aba), Nurturance (Nur), Change (Chg), 
Endurance (End), Heterosexuality (Het), and Aggression (488). The 
11 problem areas of the MPCL are: Health and Physical Development 
(HPD); Finance, Living Conditions, and Employment (FLE); Social 
and Recreational Activities (SRA); Social-Psychological Relations 
(SPR); Personal-Pschological Relations (PPR); Courtship, Sex, and 
Marriage (CSM); Home and Family (HF); Morals and Religions 
(MR); Adjustment to School Work (ACW); The Future—Vocational 
and Educational (FVE); and Curriculum and Teaching Procedure 
(CTP). 

Subjects’ scores on the ACT and their cumulative GPA’s were 
obtained from college records. Since some students’ ACT scores were 
not available from the records, the № was reduced for the correlation 
between the ACT and GPA. 

The product-moment correlations were calculated between the GPA 
and each of 15 measures of EPPS, each of 11 measures of MPCL, each 
of five ACT scores, and the single measure of TMAS. In addition, the 
multiple correlation analyses (Cooley and Lohnes, 1971) were con- 
ducted for the GPA and each weighted set of measures of EPPS, 
MPCL, and ACT, separately. 


Results 


Results of multiple correlation analyses and variables each of which 
correlated significantly (at or beyond the .05 level) with the GPA are 
presented in Table 1. 4 

In males, the variables, each of which significantly correlated with 
the GPA, were 14 measures of EPPS, eight subscales within MPCL, 
five ACT scores, and the single TMAS scale (7 = — 22). The multiple 
correlations between the GPA and each weighted set of measures of 
EPPS, MPCL, and ACT were all significant at the .001 level. 

In females, one of EPPS measures, two of MPCL subscales, and five 


ACT s igni lated with the GPA. Thus, the signifi- 
cores significantly correlate Pound eic ае ir 


cant multiple correlations between the 
measures were limited to the МРСІ. and ACT only. 

The results indicated that manifest needs, manifest anxiety, and 
personal problems associated with academic achievement were 
stronger for male black than for female black college students. 
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COMPARATIVE PREDICTION OF FIRST YEAR GRADUATE 
AND PROFESSIONAL SCHOOL GRADES IN SIX FIELDS' 


LEONARD L. BAIRD 


Educational Testing Service 


The validity of predictors of academic performance in six post- 
graduate fields were compared. The fields included three liberal arts 
areas and three professional areas: arts and humanities, biological 
and physical science, social science, law, medicine, and business. The 
predictors included information about students’ backgrounds, self- 
conceptions, values, nonacademic achievements, and curricular pat- 
terns as well as admissions test scores and grades. In most fields, 
grades were predicted by academic ability and by prior achievement, 
self-confidence, and previous accomplishment in the field. Back- 
ground variables predicted grades only in law and arts and human- 
‘ae The predictive power of admissions tests varied from field to 

еа. 


THERE have been many studies of the prediction of academic per- 
hool (e.g., Lannholm, 1968; 


formance in graduate and professional sc 

Cliff and Cliff, 1972). Although these studies have produced a great 
deal of information, as a group they have been limited in three ways: 
(1) they have concentrated on tests of academic ability as predictors; 
(2) they have usually been limited to a sample of a single department 
or school; and (3) they have not had comparison groups in other 
fields. Based on a follow-up of a national sample of college seniors the 
Present study (a) includes information about students" biographical 
characteristics, self-conceptions, work values, nonacademic achieve- 
ments, and curricular patterns as well as admissions test scores and 
8rades as predictors; (b) extends over а wide variety of schools and 
departments; and (c) compares the validity of predictors for samples 
«This study was s American Medical Colleges, the 


! This stud iation of 
y was supported by the Association о ledi 4 
Taduate Record НОВО. and the Law School Admissions Council. 
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of college seniors in relation to criteria of academic performance in six 
postgraduate fields. The six fields of post graduate study included 
graduate work in three liberal arts areas and in three professional 
areas: arts and humanities, biological and physical science, social 
science, law school, medical school, and (graduate) business school. 
The purposes of the study were to compare the validity of predictors of 
academic performance in various fields and to evaluate the contribu- 
tion of variables assessing students’ backgrounds, educational his- 
tories, self-conceptions, and values in the prediction of grades. 


Method 
Data sources 


The data for this study originated from a follow-up of a national 
survey of a sample of college seniors who replied to a questionnaire, 
the College Senior Survey, in the spring of 1971 (Baird, Hartnett, and 
Clark, 1973). Follow-up information regarding the activities of 7,734 
seniors in 94 colleges across the country was obtained in late spring of 
1972. A variety of data analyses indicated that the sample was repre- 
Sentative, except that minority students were slightly under- 
represented. 

The College Senior Survey covered a great deal of biographical, 
Personal, attitudinal, and educational information about students. 
Reports of their Admission Test for Graduate Study in Business 
(ATGSB), Law School Admissions Test (LSAT), Graduate Record 
Examinations (GRE), and Medical College Admission Test (MCAT) 
Scores were obtained. The follow-up questionnaire ascertained stu- 
dents’ educational and vocational activities. The criteria used here 
were self-reported grades in (2) graduate study in the arts or human- 
ities (М = 415); (b) graduate Study in the social sciences (М = 400); (с) 
graduate study in the biological and physical sciences (N = 525); (d) 
medical school (М = 440); (е) law school (N = 450); and (f) graduate 
business school (№ = 310). The numbers in parentheses indicate the 
number of cases in each field with complete data in both the senior and 
follow-up files. Research summarized by the American College Test- 
ing Program (1973) has shown self-reported grades to be highly re- 


liable and valid and highly intercorrelated with school reported 
grades, 


Methods 


£ By generating a missing data correlation matrix of the College 
enior Survey information, the writer was able to identify the charac- 


Eom 
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teristics that were most strongly correlated with grades in each post- 
graduate field. In relation to grades in each field as a criterion, step- 
wise multiple regression analyses were employed to identify the factors 
that were most strongly associated with grades in each area of study. 

Because the undergraduate grades were from such a wide variety of 
institutions their predictive power in the analyses was probably lim- 
ited. Similarly, since students in the follow-up sample in any particular 
area were attending a wide variety of institutions, the size of the 
correlations of any variable with graduate or professional school 
grades was probably considerably lower than the value would have 
been in many single schools. 


Results 


In Table | the zero-order correlation coefficients of each of the 
predictor variables with grades earned in each of the six fields of 
graduate and professional study are set forth. In Table 2 the outcomes 
of stepwise multiple regression analyses are summarized for each field 
of postgraduate endeavor. Only those variables which yielded statist- 
ically significant contributions and increased the multiple correlation 
by at least .01 are listed in Table 2. 

Certain variables had higher zero-order correlations in some fields 
than others as the entries in Table 1 show. Sex and religious back- 
ground had little relation to grades in any field. Parental education 
was related only in arts and humanities and business; family income 
was related only in arts and humanities and law. Parental and peer 
encouragement of students’ plans for further study was unrelated in all 
fields. Consideration of graduate or professional school at an early age 
was most positively related to grades in arts and humanities and in the 
three professional areas. Grades in all courses constituted a more valid 
predictor than did grades in major field courses in every area except 
biological and physical science. Several variables reflecting self-con- 
fidence of students in their ability to handle academic work were 
related to grades in every area, most consistently in law and business. 
Work values generally had small relations to grades in most areas 


except science, where interest in working with people was negatively 


related to grades. Admission test scores were less efficient predictors In 
des. Tests predicted most accu- 


every field except law than were gra і ; 
rately in law and business, least so in medicine. Curricular choices 
Were unrelated to grades in most areas except for senior plans to enter 
the field in medicine, and except for senior major in the field in social 
Science. Almost all nonacademic achievements, such as being presi- 
dent of the student body were unrelated to grades in any area, with the 


944 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


TABLE 1 
Comparative Predictive Correlates of First Year Graduate and Professional School Grades ' 


Background Variables 
Sex (1 = Male, 2 = Female) 
Race (1 = Black, 2 = White) 
Parental level of education 
Family income 
Father's encouragement 
Mother's encouragement 
Friend's encouragement 
Raised in Jewish religion 
Age first thought of advanced study 
College Grades 
All courses 
Major field courses 


Self-Conception 
I would rank among the best in 
academic ability in my class in 
college 
I have the ability to complete the 
advanced work needed to become 
à doctor, lawyer, or university 
professor 
I think I would be able to get mostly 
A’s in a graduate or professional 
school 
Self-rating on writing ability 
Self-rating on scholarship 
Self-rating on scientific ability 
Self-rating on mathematical ability 
Work Values 
More interest in People than things 
Desire to contribute to knowledge 


Admissions Test Scores 
GRE-V 


GRE-M 
LSAT 
MCAT 
ATGSB 


Career Choices 
Freshman vocational choice in field 
Senior major in field 
Senior plan to study in field next fall 
Level of degree aspiration 


College Activities 
Won award in field 
Assistantship in science 


Graduate Study in 
Arts & — Biol/Phs Social 


Hum 


=. 


Correlation with Grades 


Professional Study in 


Sci Sci Law Medicine Busin 
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exceptions of having an award in the field in science and business and 
holding a scientific assistantship in medicine and business. 

The general pattern of results obtained in the zero-order correla- 
tions was also obtained in the stepwise multiple regression results 
shown in Table 2. In most fields, grades were predicted by academic 
ability and achievement, self-confidence, and previous accom- 
plishment in the field. In medicine, the MCAT did not enter as a 
predictor. Background variables added to the level of prediction only 


in law and arts and humanities. 
Discussion 
Although the requirements for success in graduate and professional 
fields seem to have some common elements, each one has its unique 
; TABLE 2 
d Stepwise Multiple Regression Results 
4 Final 
Arts & Ж College grades іп all courses Hei 2251 
Humanities ^ GRE-Mathematical score ABC ERAI EM 
Parental level of education 45 2220 713,3 
Age first thought of advanced study RS Mol 
| I have the ability to complete the work needed to become a 96 
| doctor, lawyer or university professor 10 32 ul 
m Self-rating on scholarship 7.06 p. 380 
lological & College grades in major field courses n 2 i 
Physical GRE-Verbal score E 2 ca 
Science GRE-Mathematical score 08 oy 15 
E Si Won award in field 0 31 424 
lal Science College grades in all courses 5 1 33 239 
College grades in major field courses T E 174 
Senior major in social science Q8 35 138 
4 GRE-Mathematical score 9 2 353 
LSAT scores 148 33 269 
| College grades in all courses ў $ í 
I think I would be able to get mostly A's in a graduate or 13511959 42141 
к professional school 330/2107» 4117.3 
8 amily income ` : B 
Medicine College grades in all courses 9 06 dd 
Senior plans to study in field PORC 6 
College grades in major field courses uo 143 
| Self-rating on mathematical ability 7 3 119 
| Held assistantship in science -09 .34 104 
М Self-rating—scholarship 19 31 328 
2 College grades in all courses 20 37 250 
ATGSB scores s ў 
I think I would be able to get mostly A's in a graduate or Q9 39 184 
professional school 109 40 147 


Self-rating on writing ability 
Won award in field 
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pattern. Academic aptitude and achievement were important in every 
instance. Confidence of students in their ability was also important in 
every area, especially law and business. Students’ conceptions of their 
abilities had mostly logical relations to the demands of the field. 
Strictly biographical information made only a small contribution to 
prediction in most fields, as did work values. Academic success in 
some areas, such as science and social science seemed to be related 
most strongly to previous academic performance in the area, whereas 
in others, such as law, medicine, and business, students’ self-con- 
fidence plays a role, and in other areas such as arts and humanities and 
law, success was related to family background. 
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THE MODERATOR EFFECT OF UNDERGRADUATE 
GRADE POINT AVERAGE ON THE PREDICTION OF 
SUCCESS IN GRADUATE EDUCATION’ 


ROBERT W. COVERT 
University of Virginia 


NORMAN М. CHANSKY 
Temple University 


CAMS 


Three hundred and six Masters of Education students at a large 
urban university were divided into six subgroups according to sex 
and to each of three levels of undergraduate grade point average. 
Correlation coefficients between graduate grade point average and 
each of three predictor variables, consisting of Graduate Record 
Examinations—Verbal score, Graduate Record Examination— 
Quantitative score, and undergraduate grade point average, were 
calculated for each of the six subgroups. Results showed differential 
predictability across the different subgroups. 


tion programs for 


THE selection of candidates for graduate educa 
The predictors in 


many years has depended upon linear regressions. 
these equations have included undergra 
(UGPA), as well as scores on the Miller Analogy Test 
Graduate Record Examinations Aptitude Test (Verbal and Quan- 
titative Scores—GREV and GREQ), whereas 
ally been the graduate grade point average ( 
these predictors to differentiate between graduate students varies from 
institution to institution. In general, however, most prediction studies 
of graduate success have provided adm 
validity coefficients for use in the selec 
coefficients were then assumed to be equally valid for the total group. 
ll 

| ! This research was sponsored by the Internal 
Copyright 1975 by Frederic Kuder 
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Research Center of Temple University. 
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duate grade point average 
(MAT) or 


the criterion has gener- 
GGPA ). The ability of 


inistrators with only one or two 
tion of candidates. These 
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It was the purpose of this research to explore the possibility of 
differential predictability of candidates on the basis of different levels 
of UGPA and sex. 


Method 
Subjects 


The sample of 306 subjects for this investigation included all stu- 
dents who had been accepted into the Master of Education Program at 
a large urban university from September, 1967 to September, 1968, 
and for whom GREV, GREQ, UGPA, and GGPA data were com- 
plete. These students had either completed or terminated their pro- 
grams at the time of this investigation. 


Procedure 


The total group was divided into thirds on the basis of standing on 
UGPA. The lowest third included undergraduates with UGPA's of 
less than 2.5; the moderate group contained students with UGPA's 
between 2.5 and 2.9; whereas the upper third consisted of persons with 
UGPA’s of greater than 2.9. Multiple correlation coefficients using 
UGPA, GREQ, and GREV as predictors were then separately calcu- 
lated for the total group as well as for each of the subgroups which 
were subsequently divided according to sex. 


Results 
The means and sta 
in Table 1. 


For the total group, the multiple correlation between the composite 
of three predictors including UGPA, GREV, and GREQ and the 
criterion was .29. The correlation between UGPA and GGPA alone 


ndard deviations for the total group are included 


was .24 
TABLE | 
Means and Standard Deviations for Students Completing or Terminating Masters 
Programs in Education (N = 306) 
À Standard 
Variables Mean Deviation 
GREV 533 89 


с 
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Table 2 illustrates the differential predictability of the total group 
divided according to thirds on the basis of standing on the UGPA and 
by sex. 

Table 2 shows that when predicting males and females simultane- 
ously, no significant relationship existed between any one of the pre- 
dictors and СОРА for students whose UGPA was below 2.9. On the 
other hand, a significant bivariate correlation of .35 existed for those 
students with UGPA’s exceeding 2.9. The table also shows that in the 
low grade point average group, females could be predicted signifi- 
cantly (r = .33); in the moderate grade point range, males could be 
predicted significantly (r = .44); and in the high grade point range, 
only females could be predicted significantly (R = 45). 


Discussion 


The findings show that differential predictability was achieved by 
dividing the students according to sex and UGPA. Important is the 
fact that those students in the lowest third of UGPA were the least 
predictable, whereas those in the highest third were the most predict- 
able, Furthermore, the data show that females could be significantly 
predicted in both the highest and lowest thirds of UGPA, and that 
males could only be significantly predicted within the moderate range 


TABLE 2 


Correlations, Means, and Standard Deviations for Masters of Education 
rade Point Average and Sex 


lex Grade Point 

verage (Less than 2.5 

Female ) 46 3.33 26 3% GREV 
Мае 64 3.25 40 09 GREV 
Total а: ED ОН 

jos Grade Point S 
verage (2.5-2.9 
Female | 66 3.38 33/20. UGPA 
Male 31 334 31 44% GREY 
Total 103 3.36 32 19 GRE 


High Grade Point 
Average (Greater than 
29) 


Female Е С и 
M 29 14 | GREQ 

ale 22 3.4 ^ GREQ 
Total 93 344 35 35 


Note.—| " i ука 
ole.—lt should be noted that although multiple correlations wer SAS variables significantly related о e 


Of females in the hi eighted composite of two ў 
igh ШОРА subgroup was ће м 
criterion; therefore, t ei for females in the high ШОРА subgroup represents the only multiple 


Correlation in the table. 
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of UGPA. These findings should be viewed with caution because of 
the negative skewness of the scores in the criterion measure. This 
circumstance might mean that students of low ability and low achieve- 
ment, when given a chance to complete a master's degree program, 
might achieve at as high a level as those with greater ability. In looking 
at Table 2, one can see that although students with high UGPA's had 
slightly higher average GGPA’s, all three groups had GGPA's sub- 
stantially above the B (3.00) level. One explanation of these high 
GGPA's might be that little discrimination was made with regard to 
the quality of performance to which grades had been assigned. 

These results have implications for those selecting prospective grad- 
uate students in education. First, they suggest that elimination of 
candidates on the basis of one validity coefficient seems unjustified, 
since both sex and UGPA had a moderator effect. Second, and per- 
haps most important, is the fact that the use of the three predictors as 
the only selection device for candidates would be questionable since, at 
best, these predictors were accounting for no more than 20% of the 
variance in the criterion measure of success. 
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PREDICTIVE VALIDITY OF SVIB PHARMACIST SCALES? 


RICHARD W. JOHNSON AND KENNETH W. KIRK 
University of Wisconsin-Madison 


RICHARD A. OHVALL 
Ferris State College 


y of the old and new men’s Pharmacist scales 
and the new women’s Pharmacist scale on the Strong Vocational 
Interest Blank was investigated for 279 male and 76 female pharmacy 
students. Each of the three scales significantly differentiated between 
graduates and nongraduates from the pharmacy program. Separate 
sex norms appeared to be necessary for the two men’s scales, but not 
for the new women’s scale. The women’s scale most accurately identi- 
fied the pharmacy majors and produced the highest correlations with 


the criterion for both sexes. 


The predictive validit: 


Тне Pharmacist Occupational scale for the Strong Vocational Inter- 


est Blank for Men (SVIB-M) recently was revised by Campbell (1971). 


A Pharmacist scale for the SVIB for Women (SVIB-W) also has been 
constructed within the past year (Kirk, Johnson, and Ohvall, 1974). 
in scientific activ- 


The women's single scale primarily measures interest i 1 
ities whereas the men’s two scales mainly reflect business interests. 

Abridged versions of the new men's and women's Pharmacist scales 
appear on the new Strong-Campbell Interest Inventory (Campbell, 
1974). No predictive validity studies for either one of the new men's ог 
women's Pharmacist scales have been reported in the literature. 

The primary purpose of this study was to estimate the validity of the 
Pharmacist scales in predicting graduation from a school of pharmacy. 


~ 
"This study was supported іп part by à grant from the Graduate School Research 


Committee, University of Wisconsin —M adison. 
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Method 


Subjects 


The first-year pharmacy students (college juniors) at the University 
of Wisconsin-Madison for three successive years (1968-1970) served 
as subjects. Approximately 90% (279 of 312 males; 76 of 83 females) of. 
the students in the three classes completed the SVIB-M sometime 
during their first year in the School of Pharmacy. 


Predictors 


The SVIB-M (Form 7399) was administered to both male and 
female students in order to obtain comparable data for all students. 
The original men's scale (Schwebel, 1951), the revised men's scale 
(Campbell, 1971), and a modified version of the new women's scale 
were used as predictors. The modified women's scale is based on 60 
items found on both the SVIB-M and SVIB-W that differentiated 
between the interests of 431 women pharmacists and women-in-gen- 
eral by at least 14 percentage points. The SVIB-M version of the 
Women's Pharmacist scale was highly correlated (r — .94) with the 
SVIB-W version for an independent sample of 215 women pharma- 
cists tested by Kirk, Johnson, and Ohvall (1974). 


Criterion 


Graduation Status at the end of the three year period that was 
Tequired to complete the pharmacy curriculum was used as the crite- 
rion. One-sixth of the male students (48 of 279) and one-fifth of the 
female students (14 of 76) were not graduated with their class, 


Statistical analysis 
The scores on each of the thr 


means of two-way analysis of variance (sex X graduation status) for 
unequal cells. Th 


could predict gradua 
the individual scales. 


ISA pele 
* A list of the 60 items and the item weights may be obtained from the first author. 
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Results and Discussion 

АП three of the Pharmacist scales significantly differentiated (p < 
.05) between graduates and nongraduates as shown in Table 1. The 
magnitude of the difference was approximately five standard score 
points for both the new men's and women's scales (see Table 2). 

To determine whether the relationship between the Pharmacist 
scales and graduation status was contaminated by an academic ability 
factor, the Pharmacist scales were correlated with the College Qual- 
ification Tests (Verbal, Numerical, Science, Social Studies, Total) for a 
subsample of 84 of the students who had completed these tests. None 
of the correlations between the Pharmacist scales and the academic 
tests was statistically significant (p > .05). The effectiveness of the 
Pharmacist scales in predicting graduation status could not be attrib- 
uted to the academic ability of the students. 

The strength of the relationship between the interest scores and 
graduation status for each sex was determined by the point biserial 
correlation coefficients reported in the last column of Table 2. All the 
correlation coefficients were relatively low. The highest coefficient of 
28 accounted for only 8% of the variance in the criterion. Although 
the interest scores should be helpful in discussing educational and 
Vocational plans with prospective students, they should not be used as 
a basis for decision-making without additional data. 

Neither one of the two multiple correlation coefficients between the 
weighted composites of the scales and graduation status for either 
sample of male or female students was significantly greater than was 
the highest zero-order correlation. The women’s scale yielded the 
highest correlations with the criterion for both samples of men and 
women. The men’s scales did not add significantly to the prediction 
based solely on the women’s scale. 

The sexes differed significantly (p < .05) in their scores on both of 


TABLE 1 i 
Two-way Analysis of Variance of Scores on SVIB-M Pharmacist Scales 
F ratios 
Source of o "m | 
еи dr men's scale men's scale | Women's scale 
Sex (A) 1 5.24* 14.67*** 258 
Graduation status 
(B) 1 591* 8.51** pU 
0 i a 96 .51 
Error 351 
*p < 0%. 
**p« 0i 


954 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


TABLE 2 
Comparison of Mean Standard Scores on SVIB Pharmacist Scales Based on 
Sex and Graduation Status 


Graduates Nongraduates Total 
Sex Mean SD Mean SD Mean SD ть 
Original men's scale 
Male 38.9 8.9 36.1 10.7 38.4 9.3 11 
Female 36.3 9.1 31.6 10.4 35.4 9.5 .19 
Total 384 9.0 35.1 10.7 
Revised men's scale 
Male 38.7 12.0 34.7 14.0 38.0 12.4 Br 
Female 32.9 10.8 249 13.6 314 11.7 27% 
Total 37.4 12,0 32.5 14.4 
Women's scale 
Male 48.0 83 438 11.0 47,3 8.9 218% 
Female 46.3 84 40.0 9.0 45.2 88  28* 
Total 47.7 8.3 42.9 10.6 
Note.—n = 231 male graduates, 48 male nongraduates, 62 female graduates, 14 female nongraduates. 
*р< 05, 
"p< 01 


the men’s scales, but not the women’s scale. As shown in Table 2, the 
mean score for the men was more than one-half standard deviation 
(6.6 standard scores) higher than was the mean score for women on 
the new men’s scale, The male students apparently endorsed more 
business interests than did the female students. Separate sex norms 
should be used for the men’s scales; however, combined sex norms are 
permissible for the women’s Scale. 

The mean scores on the women's scale was considerably higher (9 to 
14 standard score Points) for both sexes than were the mean scores on 


None of the interactions between sex and graduation status ap- 
flecti ce. The men's scale did not predict more 
eflectively for men than for women, nor did the women’s scale predict 
more effectively for women than for men. 
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USEOF SELECTED FACTORS AS PREDICTORS OF SUCCESS 
IN COMPLETING ASECONDARY TEACHER PREPARATION 
PROGRAM 


FRANK P. BELCASTRO 


Merrimack College 
North Andover, Massachusetts 


A multiple regression analysis was performed on all EPPS and 
SVIB scales to determine those scales which could be used to predict 
success in a secondary teacher preparation program for 207 females 
and 88 males. Significant discriminant function equations for four- 
teen male predictor variables and for eight female predictor variables 
were obtained. Thus, for both males and females it was possible to 
discriminate between those who had completed the teacher prepara- 
tion program and those who had not; for applicants it was possible 
to classify them as likely or unlikely to complete the teacher prepara- 
tion program. Cross-validation studies, however, with larger samples 
would be needed to establish generalizable equations that would 
permit realization of a comparatively high degree of accuracy of 
classification of applicant members in new samples. 


A major problem of teacher preparation institutions is the selection, 
from among prospective candidates, of those most likely to be success- 
ful in completing a teacher preparation program. 

A review of the literature reveals that an impressive number of 
factors have been investigated in the hope of producing a device which 
would identify successful teachers in advance of training. These factors 
or variables were largely achievement tests or were measures of stu- 
dent characteristics (Ayers and Rohr, 1974; Chabassol, 1968; Cook, 
1964; Darrow, 1962; Michael, Jones, Gettinger, Hodges, Kolesnik, 
and Seppala, 1961; Robinson, 1962). This study examined Edwards 
Personal Preference Schedule (EPPS) and Strong Vocational Interest 
Blank (SVIB) scales as predictors of success. 
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The Problem 


Normally, application for admittance into the secondary teacher 
preparation program at Merrimack College, North Andover, Mas- 
sachusetts is made at the end of the sophomore year. Required for 
admittance are (a) 2.0 (out of 4.0) overall grade point average, (b) 2.0 
grade point average in a major, (c) record of scores on file from a 
group administration of the EPPS and SVIB measures, and (d) an 
interview. After verification that the applicant has met the grade point 
requirements and after exploration of the motivations of the applicant 
behind his vocational and educational aspirations, philosophy of edu- 
cation, and previous experience with youth, the admittance decision is 
made mutually by the student and an education department faculty 
member at the end of the interview. It would be an aid to both the 
applicant and the education faculty member in arriving at the admit- 
tance decision if some prediction could be made from selected EPPS 
and SVIB scales as to the applicant's successful completion of the 
teacher preparation program. 

The purpose of this study was to determine the usefulness of selected 
EPPS and SVIB scales as predictors of success in completing a second- 
ary teacher preparation program. 


Procedure 


/ Subjects selected Гог the study were four graduating classes of sen- 
iors (N—295) who had been admitted to the teacher preparation pro- 
Bram two years previously after meeting all requirements for admis- 
sion to the program. 

Part of these graduating seniors (М-211) successfully completed the 
teacher training program including a successful student teaching expe- 
tience; for a variety of reasons the other graduating seniors (V=84) 
transferred out of the program during the two year period and thus did 
not complete the program. They also did not have student teaching 
experience. 

The dependent or criterion variable used in this study was defined in 
terms of whether the graduating senior did or did not complete the 
teacher training program. Using this criterion of completion versus 
noncompletion, the investigator endeavored to determine whether the 
predictor variables could discriminate between these two groups. 

The independent or predictor variables consisted of scores on sub- 
scales of the SVIB and the EPPS, overall cumulative point average, 
10, and age. АП of these predictor variables (51 for the females, 79 for 

the males) were obtained at the end of the sophomore year at the time 
of formal entrance into the teacher preparation program. 


| 
| 
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Scores were analyzed separately for males and females, since the 
SVIB contained different forms for each sex and since the amount of 
overlapping between forms was limited. For the completion and non- 
completion females, the frequencies were 154 and 53, respectively; for 
the completion and noncompletion males, the numbers were 57 and 
31, respectively. 

For each of the separate samples of males and females, the null 
hypothesis to be tested was that there would be no relationship be- 
tween membership in either one of the two groups (completed or non- 
completed) and the composite of selected predictor variables. 

The Stepwise Multiple Regression (BMDO2R) analysis was рег- 
formed for the males and females on the IBM 360 at the California 
State University, San Diego. 


Data Analysis 


To test the general null hypothesis, discriminant equations were 
computed in a step-wise manner. For the males and females sepa- 
rately, the computer first selected the one “best” independent variable 
and then selected the “best” of the remaining variables from the pool 
of predictor variables. This step-wise procedure continued until the 
entrance of any other predictor variable to the composite did not 
contribute significantly to the prediction scheme. Only 14 independent 
variables were chosen for the males and 8 for the females from the 
pool of predictor variables, since the addition of remaining variables 
in both cases did not significantly increase predictive effectiveness. 
These predictor variables are presented in Table 1. 


TABLE 1 i 
Predictor Variables Which Contributed Significantly to the 
Discriminant Analysis 


NENNEN  — — —— — Qiu 
DEMOGRAPHIC: SVIB: 
9. Architect 


1. Grade Point Average ДҰ 
10. Physician 


EPPS: 11. Veterinarian 
12. Printer 

2. Autonomy 13. Industrial Arts Teacher 

3. Affiliation 14. Physical Therapist 
d И р Бл ч 

. Nut . Librarii 

6. Changa E 17. English Teacher 

7. Heterosexuality 18. Life Insurance Salesman 

8. Consistency 19. Femininity-Masculinity Scale 


Occupational Level 
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As inspection of Table 2 reveals that both male and female predictor 
combinations were significant at the .01 level; hence the null hypoth- 
esis was rejected. On the basis of the variables selected for both males 
and females it would be possible to discriminate between those who 
had completed the teacher preparation program and those who had 
not. 

Although the multiple correlation coefficients indicated the degree 
to which it would be possible to predict the dichotomous criterion, 
they did not yield the kind of information that both the education 
faculty member and applicant could use directly to arrive at the admit- 
tance decision. To facilitate the application of the foregoing analysis, 
the discriminant function equations for all 14 male predictor variables 
and all 8 female predictor variables were obtained. The generalized 
optimizing discriminant equation for the males was: 


Y, = .22285Х, -.04216Х, —.01716X, —.03183X, 
7 .01630Х, — .01776X, + .02169X, + .03108X, 
= .01985X,, + .01892X,, — .02657X,, - .00959Х,, 
+ O1161X,, — .01931X,, + 2.77798. 
The corresponding equation for the females was: 
Y; = .21174X, + .01762Х, — :03751X, — .01566Х,, 
+ .01976X,, + .00062X,, — 100961Х,, + .00574Х,, + .30417. 


To predict the group into which the scores of an applicant would 
most likely place him, one can substitute into the discriminant equa- 
Поп the actual raw score values for each of the EPPS variables and the 
report form score values for each of the SVIB variables. Through one's 
use of the solution to the equation, a male or female applicant could be 
classified with those likely to complete the teacher preparation pro- 


alii ў TABLE 2 
‘ultiple Regression Analyses for Predictor Variables and Criterion Variable 
(Completion vs. Noncompletion) for Males and Females 


Predictor Combination R R* SE E P 
XIX NINA AXX, 
MALES Hes 
XXX ххх :656 572 396 3941 <01 
XXX ХХ ив 
FEMALES 
Хижьхь 474 441 395 7169 <.01 


R* = shrunken multiple R. 
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gram if the resulting score was .65 (mean of the male scores) or higher 
for the male and .74 (mean of the female scores) or higher for the 
female, and with those unlikely to complete the program if the result- 
ing score was less than the appropriate one of these mean cutoff scores. 
The greater the deviation from the appropriate mean score, the greater 
the probability that the applicant's data would more closely parallel 
those of the particular criterion group. 

The efficiency of the equations was measured by obtaining a percent- 
age composed of the ratio of students who were predicted to complete 
or not to complete the program to students who actually did or did not 
complete the program. The efficiency of the predictive equations fol- 
lows: (1) Males—for the total group, the prediction was 78% correct; 
of those who completed the program, 74% were correctly predicted as 
successfully completing the program; of those who did not complete 
the program, 87% were correctly predicted as not completing the 
program. (2) Females—for the total group, 66% correct; of those who 
completed the program, 63% correct; of those who did not complete 
the program, 75% correct. 

As a further improvement, the efficiency of the equations was ob- 
tained as just explained except that only the top 25% and the bottom 
25% of the students were used. This improved efficiency of the predic- 
tive equations follows: (1) Males—for the total group, 89% correct; of 
those who completed the program, 88% correct; for those who did not 
complete the program, 91% correct. (2) Females—for the total group, 
7496 correct; of those who completed the program, 67% correct; of 
those who did not complete the program, 90% correct. Сшой scores 
for this group were .8587 for the high and .4365 for the low subgroup 
of males; .8778 for the high and .6000 for the low subgroups of 
females. 


Conclusions 


e to discriminate between 


For both males and females it was possibl 
her preparation program 


those who had completed the secondary teac 
and those who had not. 

For both males and female appl 
them as likely or unlikely to comp 
gram. А : 

For both males and females, applicants could be classified with 
greater accuracy as unlikely to complete the teacher preparation pro- 
gram than as likely to do so. 

Before these prediction equations c 
faculty members and applicants in arri 


icants, it was possible to classify 
lete the teacher preparation pro- 


ould be used to assist education 
ving at the admittance decision, 
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it would be necessary to carry out cross-validation studies with large 
samples to ascertain whether the same predictor variables would be 
selected and whether the weights assigned to them would remain 
relatively stable. Since there were so many predictor variables in rela- 
tion to the number of individuals studied and hence a comparatively 
small number of degrees of freedom, the larger samples would permit 
à more nearly realistic estimate of the correlation and an improvement 
in accuracy of classification when the equations are applied to new 
samples. Since this study furnished evidence about prediction of suc- 
cess for entering students in only one selective teacher preparation 
program at only one institution, generalizations at best would be 
highly tentative. 
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THE PREDICTION OF PERFORMANCE IN AN 
EDUCATIONAL PSYCHOLOGY MASTER'S DEGREE 
PROGRAM 


ANDREW BEAN 
Temple University 


This study examined the predictive validity of the Graduate Rec- 
ord Examinations Aptitude Test, Verbal and Quantitative scores 
(GREV and GREQ) and undergraduate grade-point average 
(ОСРА). Criterion variables consisted of graduate grade-point aver- 
age (GGPA), the Master's Comprehensive Examination scores 
(MCE), and grades in individual required courses. Subjects were 91 
students enrolled in the Department of Educational Psychology Mas- 
ter's degree program at a large metropolitan university. GREV corre- 
lated .31 with GGPA, but failed to correlate significantly with any 
other criterion. GREQ correlated .45 and .59 with grades in two 
research methods courses, but failed to correlate with any other 
criterion, UGPA was not significantly related to any of the criteria. 
These somewhat atypical findings stress the need for local validation 
of graduate admissions measures. 


THE predictive validity of measures used for admission to graduate 
school can vary considerably as а function of the specific criterion 
used. In many graduate programs, students are required to achieve a 
minimum grade-point average, to obtain minimum grades in individ- 
ual courses, and to pass a comprehensive examination in order to be 
awarded a degree. Thus, the validity of admissions measures needs to 
be examined using other criteria in addition to the usual graduate 
grade-point average (GGPA ). a) 

Predictive validity studies of the Graduate Record Examinations 
Aptitude Test, Verbal and Quantitative scores (GREV and GREQ) 
typically have used GGPA as the single criterion of academic perform- 
ance. Recent studies conducted within colleges of education have 
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reported correlations between GREV and GGPA ranging from the 
-20's to the .30's; correlations between GREQ and GGPA have ranged 
from low and nonsignificant to as high as .37 (Borg, 1963; Madaus& ! 
Walsh, 1965; Payne, Wells, and Clark, 1971). 

The predictive validity of undergraduate grade-point average 
(ЧОРА) has also been investigated through using GGPA as the crite- 
rion. In a representative study, Ayres (1971) found correlations be- 
tween UGPA and GGPA ranging from .28 to .69 in three different 
curricular groups within a college of education. 

The purpose of this study was to determine the predictive validity of 
GREV, GREQ, and UGPA for a variety of academic performance 
criteria іп the Department of Educational Psychology Master’s degree 
Program at a large metropolitan university. The criteria included 
grades in individual courses, comprehensive examination perform- 
ance, and cumulative graduate grade-point average. 


Variables 


The predictor variables were GREV, GREQ, and UGPA. The crite- 
tion variables were as follows: (1) graduate grade-point average 
(GGPA ); (2) Master's Comprehensive Examination score (M CE); and 
(3) individual required core course grades in Group Processes, Human 
Development, Learning Theories, Research Procedures, and Survey 
Research. UGPA, GGPA and individual course grades were com- 
puted using the usual four point System (А-4, В-3, С-2, D=1, 
Е-0). МСЕ scores were obtained from essay responses to questions 
based on the content of the core courses just specified. 


Sample 


The Subjects were drawn from 91 students enrolled in a General 
Educational Psychology Master's degree program in a large urban 
university, Th 


ese students took the MCE for the first time between 
Fall 1969 and Spring 1971, 


Results 


Means and Standard dev; 
are given in Table 1. Corr 
are presented in Table 2, $ 
Parentheses adjacent to the correlation coefficient. 

As indicated in Table 2, GREV correlated .31 with graduate grade- 
Point average; GREQ and UGPA were uncorrelated with GGPA. A 


lations for predictor and criterion variables 
elations among the predictors and criteria 
ample sizes for each correlation are cited in 


лы я 
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ТАВІЕ 1 
Means and Standard Deviations for Predictor and Criterion Variables 
Standard 

Mean Deviation N 

Predictor variables: 
GREV 571 83.40 60 
GREQ 520 95.30 60 
UGPA 2.78 34 86 

Criterion variables: 
GGPA 3.41 27 91 
Group Processes 3.58 .50 66 
Human Development 3.33 93 75 
Learning Theories 328 53 75 
Research Procedures 2.95 43 62 
Survey Research 3.35 лі 31 
MCE 49.60 6.81 91 


stepwise regression analysis, undertaken to determine whether some 
optimal weighted combination of GREV, GREQ, and UGPA would 
more accurately predict GGPA than would the GREY alone, in- 
dicated that combining GREQ and UGPA and GRENV failed signifi- 
cantly to increase the predictability of graduate grade-point average. 
Total GRE score (a simple sum of GREV and GREQ) correlated .25 
with GGPA. It should be noted that this coefficient is lower than the 
31 correlation coefficient obtained from using GREY alone. Thus, 
GREQ was not a useful predictor of graduate grade-point average 
when used alone or in combination with GREV. : 

GREQ correlated .45 with grades in Research Methods and .59 with 
grades in Survey Research. The content of these two courses included 
an introduction to elementary statistical methods. Thus, GREQ was а 
useful predictor of performance only in courses with a strong quan- 
titative emphasis. 


TABLE 2 : 
Correlations among Predictor and Criterion Variables" 


Predictor Variables 


Graduate GPA 31% 10 0500 
Group Processes 02,40) = 0240) — 106 
Human Development 0946 — Muo Qin 
Learning Theories 04) ЕШ 7.0205 
Research Procedures -.07% .45** uo NUM 
Survey Research 10,24) .59** ao За» 
МСЕ 19% -.13 ЕС 


L NNNM o o. с— ч 


*p« 05. 
"*p« 9l. 
Sample sizes specified in parentheses. 
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GREV, GREQ, and UGPA were not significantly correlated with 
MCE score. However, GGPA correlated .59 with MCE scores. As 
would be expected, a student's performance on the MCE was much 
more closely related to the grades he received in courses than to 
measures of his academic aptitude obtained prior to admission. 


Discussion 


The correlation obtained between GREV and GGPA was typical of 
those reported by other investigators (Borg, 1963; Madaus and Walsh, 
1965; Payne et al., 1971). However, the failure of UGPA to predict any 
criterion of academic performance was somewhat surprising in light of 
the findings of other investigators (Ayres, 1971; Payne, et al., 1971). 
Restricted range in UGPA did not appear to limit predictability, since 
UGPA scores ranged from 2.0 to 3.6. No evidence of curvilinearity 
was found. 

One possible explanation of the low correlation between UGPA and 
GGPA is that the Department of Educational Psychology Master's 
degree students came from an unusually wide variety of undergraduate 
colleges and curricula. The meaning of a “3.00” average varied de- 
pending upon the college and curriculum in which it was earned. Thus, 
the noncomparability of undergraduate records may have made them 
worthless as predictors in this case. 

Although GREV was a significant predictor of GGPA when used 
alone, prediction was not improved by forming a weighted composite 
of GREV and GREQ. GREQ was shown to be a useful predictor of 
grades in research courses, whereas GREV was not. Thus, the data 
indicate that high quantitative ability did not compensate for low 
verbal ability in predicting total graduate grade-point average; sim- 
Папу, high verbal ability did not compensate Гог low quantitative 
ability in predicting performance in research courses. Prediction using 
a weighted composite, as in multiple regression, rests on such a com- 
pensation hypothesis. If performance in both total GGPA and individ- 
ual research courses is considered important, minimum scores for 
admission should be set separately for GREV and GREQ, rather than 
using some weighted composite. 

A practice commonly followed in making admission decisions is (0 
set a minimum GRE total aptitude score, instead of using GREV and 
GREQ separately, For the data in this sample, such a practice actually 
reduced the predictive validity from that obtained in using GREV 
alone. Thus, the validity of the GRE total aptitude score should be 
checked empirically, rather than simply assumed. 

The lack of predictive validity shown by UGPA in this sample 
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further emphasizes the necessity for local validation of admissions 
easures. The fact that a measure have shown predictive validity in a 

“number of settings is no guarantee of validity in a particular location. 
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THE RELATIONSHIP OF THE WATSON-GLASER CRITICAL 
THINKING APPRAISAL TO SEX AND FOUR SELECTED 
PERSONALITY MEASURES FOR A SAMPLE OF DUTCH 

FIRST-YEAR PSYCHOLOGY STUDENTS 


JOH. HOOGSTRATEN ann Н.Н.С.М, CHRISTIAANS' 
University of Amsterdam 


In reaction to an earlier publication by Simon and Ward (1974) on 
the 1952 version of the Watson-Glaser Critical Thinking Appraisal, 
data are presented on the relationship of the 1964 forms of the same 
instrument to four selected noncognitive measures for a sample of 
190 Dutch psychology students. Except for Subtest 5, Evaluation of 
Arguments, subtest and total score means were significantly lower 
for Form ZM than for Form YM. Reliabilities of the Watson-Glaser 
(W-G) subtests ranged from only .22 to .69. Total score KR-20 
reliability estimates, however, were „72 (ZM) and .77 (YM). 

No sex differences were found. The correlation between the мо 
total scores and those оп the extroversion-introversion measure was 
not significant. Correlations of the W-G measure with other person- 
ality characteristics (neuroticism and rigidity) were also close to zero. 
As for version ZM of the W-G measure the performance was signifi- 
cantly associated with test-defensiveness. 


Tue present research was undertaken to study the concurrent valid- 
ity of the Watson-Glaser Critical Thinking Appraisal (Watson and 


Glaser, 1964, Form ҮМ and ZM), relative to five affective measures 


With Dutch male and female first-year psychology students. To achieve 
dents were compared 


this objective, the results obtained with Dutch stud ver 
- With those found by Simon and Ward (1974) with British students. 
Simon and Ward studied the relationship between scores on the 1952 
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TABLE | 
Means and Standard Deviations of Scores of Psychology Students on the Watson-Glaser 
Critical Thinking Appraisal, Forms YM and ZM 


Means SDs 
and Maximum YM ZM YM ZM Student's 
Total scale Score m*96 n=97 n= % 


version of the Watson-Glaser and (а) sex and (b) scores on a measure 
of introversion-extroversion. In the current Study, the association of 
the test with measures of neuroticism, rigidity, and test-defensiveness 
_ Was also determined, A negative relation between the Critical Think- 


» Subtest intercorrelations, subtest-total scale correlations, and 
coefficients (KR-20) of both the YM and ZM form were 


Procedure 


Ninety-seven male and 92 female first-year psychology students of 
the University of Amsterdam, about two-thirds of all the psychology 


ё 
$ 
ғ 
Py 
H 
2 
ғ 
| 
Р 
i 


female, furthermore, completed personalit tionnaires on rigidity 
— 1968), eum (М), Neuroticism Manifested у 

Complaints (NS), Extroversion-Introversion (E), and Test-Tak- 
ing Attitude (T). These four last-mentioned ен were a 


TABLE 2 
Айди (KR20) Eimates of the Watson-Glaser Forms YM and М | 


i 2 3 4 5 тош 
T 2" „2 


HOOGSTRATEN AND CHRISTIAANS m 


TABLE 3 
Subtest-Intercorrelations and Subtest-Total Correlations for Forms 
ҮМ and ZM of the Waison-Glaser 


2 y 4 5 
YM ZM YM ZM ҮМ ZM YM ZM YM 2м 


onofAsumptions 24 O} -- = 
A 2 0 ¬ - 
retatio 3-9 м» - - 
of Arguments зэ тты 9 5m - - 
1 56 6 4 № » n м e 9 


sed by the Amsterdamse Biografische Vragenlijst, the Dutch 
fon of Eysenck's М.Р.1. (Wilde, 1963). 


Results 


¢ results of univariate / tests indicate that for the sample studied 
YM was less difficult than was Form ZM. The data are given in 
ible 1; the means, standard deviations, and / ratio's of the two 
Mples (Forms YM and ZM) of the five subtests, as well as the total 
bores. The difference between the means of the total scores was 
Bnificant at the .001 level. The difference between the means of the 
blests were also significant, The only exception to the general 
that Form ZM was more difficult than was Form УМ was 
it 5 (Evaluation of Arguments), The difference, however, was not 
tial (63), though significant (p < 05). On the whole, the 
reliabilities shown in Table 2, were low in that they ranged 
bm .22 to .69, For the total scales, however, higher KR-20 values of 
(УМ) and 72 (ZM) were obtained, As for subtest 
hd subtest-total scale correlations, which are given in Table J, it is 


ir that the correlations were low. 
Subtest-total scale correlation coefficients were more subuantial 


J 


Form YM 
Male 
(24) 0M 
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than were the intercorrelations of the subtests, as the coefficients var- 
ied from .43 to .78, though the obtained values were overestimates 
because the subtests are, of course, parts of the total scales. 

Correlations between the Watson-Glaser measure and each of the 
four personality characteristics are presented in Table 4. With one 
exception, the measure of test-defensiveness on the ZM form, no 
significant values were obtained. Moreover, the coefficients of correla- 
tion of the total scales with rigidity were low and also not significant: r 
= .10(YM)r= —.01 (ZM). Finally, the mean differences between the 
total scores of male and female subjects did not reach statistical signifi- 
cance. Men obtained means of 76.73 (YM) and 71.53 (ZM); women 
76.51 (YM) and 70,26 (ZM). 


Discussion 


Supporting the data reported by Simon and Ward with British | 


students, the results with the Dutch students suggest that sex is not a 
significant factor in performance on the Watson-Glaser Critical Think- 
ing Appraisal, Forms YM and ZM. The data also indicate that per- 
formance on these equivalent critical thinking instruments is probably 
independent of two personality characteristics, neuroticism and rigid- 
ity. Furthermore the results confirm the conclusion reached by Simon 


and Ward on the 1952 version that there is no relation between the | 


Watson-Glaser measure and the Extroversion-Introversion scale. It 
was shown that there was a significant negative relation between the 
critical thinking performance and test-defensiveness, though this result 
held only for Form ZM and the correlation did account for no more 
than about 10% of the variance. 


As can be seen from Table 1 Form ZM seems to be more difficult 


than Form YM. This result upholds the position taken in the manual ) 


of the Watson-Glaser Critical Thinking Appraisal which states that 
Separate percentile norms hold for Forms YM and ZM. 

The manual furthermore reports low subtest intercorrelation coeffi- 
cents. The data support the idea that these subtests measure relatively 
distinct abilities, This result can be at least partly explained, however, 
by the Moderately low reliability coefficients of the five subtests of 
both forms, 


Though Overestimated the total scales reach more nearly acceptable 
reliability levels than the subtests do. 
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| CONVERGENT AND DISCRIMINANT VALIDITIES ОЕ TWO 
) SETS ОЕ MEASURES OF SPATIAL ORIENTATION AND 
VISUALIZATION 


LEWIS PRICE AND JOHN ELIOT 
University of Maryland 


Responses from high school sophomores to the Eliot-Price and the 
Guilford-Zimmerman tests of spatial orientation and visualization 
were evaluated in terms of the multitrait-multimethod matrix tech- 
nique. The Eliot-Price tests were found to meet all criteria for con- 
vergent and discriminant validity. However, the Guilford-Zimmer- 
man tests did not meet one of the criteria for discriminant validity. 
The Eliot-Price tests appeared to be more nearly precise measures of 
spatial orientation and visualization. 


Вовісн and Bauman (1972) examined the convergent and discrimi- 
nant validities of the French (1962) and the Guilford-Zimmerman 
(1956) tests of spatial orientation and visualization. Using the multi- 
trait-multimethod matrix technique (Campbell and Fiske, 1959), they 
found that the variance attributed to method exceeded the variance 
attributed to traits, and concluded that although there was evidence 
| for convergent validity, there was little evidence for discriminant valid- 

ity (Table 1). 


Purpose and Procedure 


ch was to make a similar comparison of 


The purpose of this resear 1 
man tests of spatial orienta- 


_ the Eliot-Price and the Guilford-Zimmer i 4 
_ Чоп and visualization. Price (1974) described the Eliot-Price tests in 
detail, and also showed that the pair of tests correlated highly with 
Other putative measures of the same abilities. In the present study, the 
lwo sets of tests were given to 39 randomly selected high school 
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TABLE 1 
Multitrait-Multimethod Matrix for French and 
Guilford-Zimmerman SR-O and Vz Tests 


Method 
Guilford-Zimmerman French 


Trait SR-O Vz SR-O Vz 


Guilford- 
Zimmerman SR-O (.88)* 
Vz 67 (93) 


French SR-O EE S (66) 
Vz и 155 (51) 


Note.—All correlation coefficients were significant beyond the .05 level. 

* Alternate forms reliability reported by Guilford and Zimmerman (1956). 
* Kuder-Richardson 21 reliability reported by Guilford-Zimmerman (1956). 
* Alternate forms reliability reported by the authors. 


sophomores, and their responses were evaluated in terms of the multi- 
trait-multimethod matrix technique (Table 2). 


The values in the diagonal, representing the convergent validity data 
from this study, were higher than were those obtained by Borich and 
Bauman (.60 and .62 compared with .48 and .44 respectively). The low 
reliabilities of the French tests might account for the low convergent 
validity coefficients in Table 1. Indeed, the French tests might not be 


* 


$ TABLE 2 4 
Multitrait-Multimethod Matrix Sor Eliot-Price and Guilford-Zimmerman SR- 
and Vz Tests 


Method T a 
Guilford-Zimmerman Eliot-Price 
Trait x SR-O Vz 
Guilford- SRO XE 
Zimmerman SR-O (88) 


Vz 1 .93) 
Eliot-Price SR-O Ec (80) 
oh oe 52 (93) 


Note.—All correlation coeficients were significant beyond the 05 level, 


ted by Guilford-Zimmerman (1956). / 


‘Coefficient alpha reported by authors, 
* Coefficient alpha reported by Price (1974). 


Results апа Discussion 
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as stable measures of spatial orientation and visualization as those of 
Guilford-Zimmerman and Eliot-Price. 

The correlations (.60 and .62) between independent methods (differ- 
ent authored tests) measuring the same intended traits of SR-O and Vz 
exceeded the correlations (of .51 and .52) between either one of them 
and other traits not having the method (same authored tests) in com- 
mon. Thus, the criterion of discriminant validity was met satisfac- 
torily. Similarly, the correlations (.60 and .62) between independent 
methods (different authored tests) measuring the same intended trait 
exceeded the correlation of .52 between different traits of SR-O and Vz 
which employed the same method or same authored test by Eliot and 
Price. However, in the instance of the Guilford-Zimmerman tests, the 
criterion of discriminant validity was not met, as the correlation of .71 
between the different traits of SR-O and Vz exceeded the correlation of 
60 and .62 between the different authored tests on the respective same 
traits of SR-O and Vz. In this latter case, the criterion of discriminant 
validity was more satisfactorily met for the Eliot-Price tests than for 
the Guilford-Zimmerman tests. It appears that the Guilford-Zimmer- 
man tests might have another factor in common which is indicated by 
the .71 correlation. The ability to understand complex verbal instruc- 
tions could conceivably be that other common factor. 

The Eliot-Price tests appeared to be more nearly precise measures of 
spatial orientation and visualization in that they met all criteria for 
convergent and discriminant validity. 
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FACTORIAL DIMENSIONS OF THE JESNESS INVENTORY 
WITH BLACK DELINQUENTS 


ROGER WOODBURY 


Wilson County Technical Institute 
Wilson, North Carolina 


JAMES SHURLING 
North Carolina State University 


The study identified the personality dimensions in the Jesness 
Inventory (JI) among black male delinquents. A random sample of 
250 black male delinquents was administered the JI. A principal 
components factor analysis with a varimax rotation identified three 
factors: (1) Self-Estrangement, (2) Social Isolation, and (3) Immatur- 
ity. The proportions of total common-factor variance accounted for 
by the three factors were .511, .286, and 1203 respectively. The results 
suggest that the factors might be a part of a larger alienation con- 
struct in black delinquents. 


and sociopathic are often 


Terms such as hostility, aggressiveness, 
le empirical evidence exists 


used to label black delinquents. To date, litt! 
which describes meaningful dimensions related to black delinquency. 
In a study of black adolescents, Shuman and Hatchett (1974) con- 
cluded that one of the most pervasive factors in black adolescents was 


alienation. 

The absence of data and the res 
(1974) study appeared sufficiently promising to encourage further in- 
vestigation. The purpose of the study was to identify meaningful рег- 
sonality dimensions among black delinquents within the Jesness Inven- 
tory (JI) (Jesness, 1966) as а first step to establishing the construct 


validity of the instrument. 


ults of Shuman and Hatchett's 
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Method 


The subjects were 250 adjudicated, black delinquent males ran- : 
domly selected from the North Carolina Youth Development Schools, 
The mean age was 14.6 years. All subjects were administered the JI in 
the youth development schools. 

The JI is a 155 item, true-false inventory purporting to measure 10 
personality characteristics of delinquents and deviant adolescents: (1) 
Social Maladjustment, (2) Value Orientation, (3) Immaturity, (4) Au- 
tism, (5) Alienation, (6) Manifest Aggression, (7) Withdrawal, (8) 
Social Anxiety, (9) Repression, and (10) Denial. Raw scores for each 
variable are converted to T scores. 

A principal components factor analysis was computed from the 
matrix of product moment correlations. For the analysis, the total 
variance was factor analyzed by placing unities in the diagonals. Fac- 
tors with eigenvalues greater than unity were extracted and rotated 
through a varimax solution. 


Results and Discussion 


The rotation of factors resulted in the extraction of three factors 
having corresponding eigenvalues of 4.09, 2.29, and 1.62. Major re- 
sults of the factor analysis are presented in Table 1. 

The first factor was termed Self-Estrangement and the proportion of 
the total variance extracted was .511. Scales heavily loaded on Factor I 
seem to describe the black delinquent as possessing antisocial tenden- 
cies, alienated feelings toward others, distorted perceptions of reality, 
a dissociation between events and the person, and feelings of frustra- 
tion and anger. These behavioral characteristics in black delinquents 


TABLE | 
The Rotated Factor Matrix of the Jesness Inventory Scales 


Factor Loadings" 


Scales I п ш 
1. Social Maladjustment 46 
2. Уаше Orientation 5 Б = E 
3. Immaturity = p 82 D 
4. Austism 85 z ; d 
5. Alienation 86 pé S is 
6. Manifest Aggression 6 48 = р. 
7. Withdrawal Ew m ES 
8. Social Anxiety 3s 46 E 3 
9. Repression PS j 86 77 
10. Denial 81 я s 


* Loadings less than .40 omitted. 


WOODBURY AND SHURLING 981 


seem to be basic to attitudes of self-estrangement. The factor of Self- 
Estrangement may refer to black delinquents’ beliefs that they are not 
what they would like to be. 

The second factor was termed Social Isolation, and the proportion 
of total variance extracted was .286. Inspection of the scales on Factor 
П reveals that this dimension describes the black delinquent as with- 
drawn and unhappy, angry, and disturbed over interpersonal relation- 
ships. This factor may reflect black delinquents’ feelings of isolation 
from family and society. 

The third factor was termed Immaturity, and the proportion of total 
variance extracted was .203. Although only two scales had high load- 
ings (Table 1) on Factor Ш, the dimension described black 
delinquents as repressing feelings and beliefs normally experienced by 
people and as displaying attitudes about themselves and others nor- 
mally experienced by younger persons. 

Results suggest that the personality dimensions associated with 
black delinquents are an indication of social and emotional malad- 
justment. The factors of Self-Estrangement, Social Isolation, and Im- 
maturity may be dimensions of a larger more pervasive construct of 
alienation in black delinquents. The implication of the current study is 
that new research identify meaningful personality dimensions of black 
delinquents as well as those of white and Indian groups. 
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THE DEVELOPMENT AND VALIDATION OF THE STUDENT 
OPINION INVENTORY FACTOR SCALES' 


JAN PERNEY 
Boston College 


A description is given of the development of the Student Opinion 
Inventory (SOI), an instrument designed to measure attitudes of 
students in secondary schools toward several aspects of their schools. 
The purposes of this investigation were to determine the concurrent 
validity of the SOI factor scales and to examine the reproducibility of 
the reliability estimates found in pilot studies of the instrument. 
Responses to the SOI from 367 students indicated that 5 of the 6 
factor scales of the SOI possessed some concurrent validity. Further- 
more, the obtained reliabilities of the factor scales were relatively 
high for attitudinal measures and closely reproduced reliability esti- 
mates established by the final pilot study of the instrument. 


ONE aspect which was seldom included in school evaluations until 
recently is the assessment of attitudes of students toward their schools. 


One reason that students’ attitudes were not widely assessed may have 


resulted from the absence of an appropriate instrument to measure 
program. Attitudinal in- 


students? attitudes toward the entire school i 
struments developed prior to 1973 either gathered information on only 
one or two dimensions (Finch, 1969; Educational Testing Service, 
1972) or furnished data too global to be of diagnostic value (Remmers, 
1952). To provide an instrument which measured attitudes of students 
toward several aspects of their schools, the Student Opinion Inventory 
(National Study of School Evaluation, 1974) was developed under the 
auspices of the National Study of School Evaluation (NSSE). The 
Purposes of this current investigation were to determine the con- 
current validity of the Student Opinion Inventory ($01) factor scales 


! The author wishes to thank his doctoral committee and Dr. Peter Airasian for 


commenting on previous drafts of this paper. 
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and to examine the reproducibility of reliability estimates found in 
pilot studies of the instrument. 


The Instrument 


The pilot SOI contained 48 5-point scale items written to corre- 
spond to sections in Evaluative Criteria (National Study of School 
Evaluation, 1970), a widely circulated publication of the NSSE used 
for school evaluation and accreditation purposes. Each of the six - 
groups of items which corresponded to a section in Evaluative Criteria 
was hypothesized to constitute a factor scale for the final instrument. 
The groups of items were written to assess attitudes of students toward 
(a) their teachers, (b) their counselors, (c) the school administration, 
(d) the curriculum and instruction of their school, (e) cocurricular 
activities, and (f) school characteristics in general. \ 

To date one local and one national pilot study of the SOI have been 
conducted. For the local pilot study the SOI was administered to 
students in three urban midwest high schools in April, 1972. A princi- 
pal components factor analysis with orthogonal rotation of the factors 
was used to analyze 1164 student responses. Since there were 9 eigen- 
values greater than 1.0, 9 factors were rotated; however, only 6 factors 
were considered interpretable. The factor analysis indicated that the 
factors emerged essentially as predicted and that the coefficient alpha 
reliability estimates of the factor scales ranged from .66 to .86, with a 
median reliability of .80, Thus, the factor scales were judged to possess 
adequate reliabilities to warrant further investigation. The pilot SOI 
was revised on the basis of the local pilot study data. 

For the second pilot study, a random sample was drawn from all 
accredited secondary schools in the continental United States. The 
SOI was administered to 1157 students from 43 schools in the sample 
during February, 1973. The same method of analysis that was used in 
the local pilot study was employed to examine the responses in the 
national pilot study. The results indicated a factor structure similar to 
that found in the pilot test analysis. Furthermore, the estimated relia- 
bilities of the factor scales remained substantial. Following are the 6 
factors which were obtained from the second pilot study. A descrip- 
tion of the factor, the number of items (п) per factor, and the reliability 
estimate (r) for each factor are given: 

Student-Teacher (ST), Perceptions held by students about their 

teachers’ helpfulness in learning subjects; л = 7: r = .82. 

Student-Counselor (SC), Students expressed feelings concerning 


counselors' helpfulness in vocational, academic and personal mat- 
ters; n = 5; r = 81. 
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Student-Administration (SA). Students’ attitudes toward the treat- 
ment of students by the administration; n = 6; r = 76. 
Student-Curriculum and Instruction (SI). Students’ perceptions 
about the adequacy of the curriculum and quality of teaching; n = 
5;г = .75. 
Student-Participation (SP). Students’ attitudes toward cocurricular 
activities and participation in school life; n = 5; r = .69. 
Student-School Image (SS). Satisfaction expressed by students with 
school in general and pride in their school; и = 6; r = .78. 
The final form of the SOI consisted of 34 items determined by the 
factor analysis of the responses from the national pilot study. More 
information concerning the background and use of the SOI is con- 
tained in the manual for the instrument. 


Methodology 


To investigate the concurrent validity of the first three SOI factor 
measures, semantic differential scales were constructed to reflect atti- 
tudes similar to those assessed by the ST, SC, and SA factor scales. 
For example, a semantic differential scale was constructed to measure 
how much students liked their counselors. The correlation between the 
semantic differential scale designed to portray student liking of coun- 
selor and the SC factor scale was expected to be moderate and posi- 
tive. To investigate the validity for the two additional SI and SP factor 
Scales, semantic differential scales which were intended to measure the 
same attitudes as those thought to be portrayed in the SI and SP factor 
scales were constructed. Since it was hypothesized that the same atti- 
tudes were being measured by both the SOI factor scales and semantic 
differential scales, it was expected that the validity coefficients would 
be positive and somewhat larger for the SI and SP scales than those for 
the ST, SC, and SA factor scales. Because of time constraints, no 
validity data were gathered for the SS factor scale. 


Results and Discussion 


The results of the current study Were based on responses of 367 


students from seven secondary schools located throughout the E. 
States. The coefficient alpha reliability estimates for the semantic dii- 


ferential scales ranged from .70 to .87 with a median reliability of .84. 
The correlations of the ST, SC, and SA factor scales with their seman- 
tic differential counterparts were .38, 49 and .36, respectively, whereas 
the reliabilities were .80, 81, and .75, respectively. Because in this case 
the SOI factor scales and the semantic differential scales were hypoth- 
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esized to be measuring similar, but not the same attitudes, the moder- 
ate positive correlations were expected. The validity coefficients for the 
SI and SP factor scales were .59 and .50 respectively, whereas the 
reliability estimates were .75 and .66, respectively. As expected, the 
correlations were positive and somewhat larger than were the correla- 
tions for the ST, SC and SA factor scales. Moreover, the reliability 
estimates for the factor scales in this current study closely approx- 
imated the reliability estimates determined in the national pilot admin- 
istration of the instrument. For example, the obtained reliability for 
the SA factor scale in this current study was .75, whereas for the 
national pilot study the obtained reliability was .76. 


Conclusions 


The following conclusions were formulated: 

1. The factor scales of the SOI appeared to be sufficiently reliable. 
Not only were the internal consistency reliability estimates relatively 
high for attitudinal measures, but also the reliability estimates found 
in this current study closely approximated those obtained in the na- 
tional pilot study. 

2. The factor analysis of the responses in the pilot tests provided 
validity evidence in two ways. First, the factors which emerged in the 
pilot test were predicted when the instrument was constructed. Sec- 
ond, similar factor structures were obtained in both the local and the 
national pilot tests. 

| 3. Further validity evidence was established by this current study, 
since the correlations predicted to exist between the SOI factor scales 
and the semantic differential scales were obtained. 
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THE RELATIONSHIP OF READING ACHIEVEMENT, 
SCHOOL ATTITUDE, AND SELF-RESPONSIBILITY 
BEHAVIORS OF SIXTH-GRADE PUPILS TO COMPARATIVE 
AND INDIVIDUALIZED REPORTING SYSTEMS: 
IMPLICATIONS FOR IMPROVEMENT OF VALIDITY OF THE 
EVALUATION OF PUPIL PROGRESS 


THOMAS W. BUTTERWORTH 
Los Angeles County Schools 


WILLIAM B. MICHAEL 
University of Southern California 


_ Two comparable samples of sixth-grade pupils were recipients of 
information furnished by two different systems of reporting pupil 
progress: (a) one involving use of a traditional competitive and 
comparative A-F letter grade approach and (b) the other embracing 
a highly individualized procedure which consisted of detailed narra- 
tive statements providing evaluative feedback on performance in the 
school setting. A 2 X 2 X 2 quasi-experimental design (reporting 
System X IQ X sex) was employed with dependent variables includ- 
ing measures of (a) reading achievement, (b) school attitude, and (c) 
self-responsibility for intellectual attainments. Results from three 
univariate analyses of variance revealed significant main effects for 
each dependent variable favoring the individualized reporting system 
over the traditional one, high ability children over low ability chil- 
dren, and girls over boys. In the instance of the measure of in- 
tellectual self-responsibility a significant interaction occurred be- 
tween ability level and mode of reporting which suggested that as 
compared with the traditional competitive mode an individualized 
Teporting system would yield differential outcomes indicating a 
higher average level of intellectual self-responsibility for children of 
low ability but no appreciable difference in average level of self- 
responsibility for children of high ability. Implications of the use of 
individualized reporting systems for improving the validity of eval- 
uating pupil progress are discussed. 
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Рок two samples of 300 sixth-grade pupils (600 children altogether) 
from two different adjacent Southern California suburban commu- 
nities of a comparable socio-economic level and similar ethnic compo- 
sition the purpose of this investigation was to determine whether 
differences in reading achievement, attitude toward school, and self- 
responsibility (dependent variables) were related to (a) employment of 
either one of two systems of reporting pupil progress—one con- 
stituting a traditional comparative A to F letter grade approach and 
the other an individualized procedure involving an evaluative feed- 
back in the form of narrative statements without symbols (treatment 
factor), (b) high or low general ability (as determined respectively by 
placement at an IQ of 100 or above or falling at an IQ below 100 in 
terms of a deviation IQ based on the total raw score on the five-test 
Verbal Scale of the Lorge-Thorndike Intelligence Tests, 1964, Multi- 
Level Edition, Level D, Form 1), and (c) sex. Interest also centered on 
possible interactions between the treatment factor and either general 
ability level or sex with respect to either reading performance or 
affective dimensions of school attitude or self-responsibility. Because 
of the paucity of published research, the data from this study could 
possibly provide guidelines for developing report systems that would 
have augmented validity in the evaluation of pupil progress. 


Methodology 


In the data analysis a quasi-experimental 2 X 2 Х 2 factorical design 
(treatment X IQ X sex) (Kirk, 1968, pp. 221-227) was used to examine 
main effects and interaction effects associated with posttreatment 
Scores on each of three dependent variables: (a) reading achievement 
аз indicated by standing on the Comprehensive Test of Basic Skills 
(CTBS), Reading (Level 2, Form R), total score, (b) attitude toward 
school as revealed by responses to а Semantic Differential (SD) scale 
consisting of 12 school concepts each permuted with five bipolar adjec- 
tives (evaluative in emphasis), and (с) self-responsibility as defined by 
placement on the Intellectual Achievement Responsbility (IAR ) meas- 
ure (Crandall, Katkovsky, and Crandall, 1965). 

Pretreatment equivalence of the two samples studied was demon- 
strated by a lack of significant differenecs in pretreatment mean scores 
on each of the three dependent variables as well as in mean IQ scores 
which to the nearest integer were 101 and 102 for the two samples. 


Findings 


The major results of Posttreatment testing, which are set forth in 
Tables 1 and 2, may be summarized as follows: 


E 


A 
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р ТАВІЕ 1 
Summary of Analyses of Variance for Reading Achievement Scores on the CTBS, Attitude 
Scores on the SD, and Self-Responsibility Scores on the IAR 


Attitude toward 
Source Reading Achievement School Self-Responsibility 
of (CTBS) (SD Measure) (IAR Measure) 
Variation df MS F MS F MS Е 
porting System 
(RS) 1 15.55 7.44%% 50912.85 21.92%% 301.04 16.53** 
x(S) 1 8.21 3.93* 50178.59 21.60%% 217.20 11,93“ 
"Ability(A ). ] 113491 543.02** 20638.90 8.88** 312.48 17.16** 
RSX S 1 0.04 = 539.60 — 3.08 — 
RSX A 1 1.23 - 2683.99 1.16 126.04 6.92** 
SXA 1 0.49 — 1016.66 — 2.54 и 
RSXSXA T 0.02 — 2472.56 1.06 1.98 — 
Within 592 2.09 = 2323.11 = 18.21 - 
Subgroup CTBS Means SD Means IAR Means* 
omparative Reporting 6.82 173.24% 2471 
dividualized Reporting 7.14 154.82 26.13 
y 6.86 173.17 24.82 
1 7.10 154.82 26.02 
High Ability 8.36 158.16 26.14 
Ow Ability 5.61 169.89 24.0 


т< о. 
[lower number indicative of more positive attitude. 


1. On each of the three dependent variables statistically significant 
main effects occurred relative to mode of reporting system em- 
ployed, ability level, and sex. In particular, the mean scores on the 
reading and self-responsibilitity measures were observed to be 
higher for the individual reporting system than for the competitive 
A to F marking system, for boys than for girls, and for high ability 
than for low ability children. On the SD measure of attitude toward 
school the mean scores were lower (indicating a more favorable 
attitude) for the subgroup exposed to the individual reporting sys- 


n. 
| 


TABLE 2 Ju 
Mean Self-Responsibility Scores of the IAR in the Statistically Signifi 
System by Ability Interaction 


cant Reporting 


Difference between 
Ability А, and 4; Means 
Low (А;) (Column 1 minus Column 2) 


Reporting System High (41) 
Comparative (RS;) 25.89 23.53 2.36 
Individualized (RS;) 26.39 25.87 0.52 
Differences between RS; and 
RS, means 0.50 2.34 


Row | minus Row 2) 


990 


N 


N 


- Although pu 


EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


tem than for the subgroup exposed to the competitive reporting 
system, for girls than for boys, and for the high ability level than for 
the low ability level subgroups. In terms of two-tailed t-tests all 
differences were significant beyond .01 level except for the single 
difference between girls and boys on the reading achievement vari- 
able which was significant at the .05 level (but still at the .01 level 
for the directional null hypothesis that the mean of the population 
of girls would be less than or equal to that of the boys). 


- The only statistically significant interaction effect (p < .01) was that 


between treatment (type of reportin g system) and intellectual abil- 
ity level for the dependent variable concerned with self-responsi- 
bility. An inspection of the entries in Table 2 reveals that in the 
instance of the comparative reporting system (RS,) a difference of 
2.36 occurred between the means of 25.89 and 23.53 for the high 
ability (4,) and low ability (4;) subgroups, respectively, ог that in 
the case of the low ability (4,) subgroup a difference of 2.34 oc- 
curred between the means of 25.87 and 23.53 for those subgroups 
exposed, respectively, to the individualized (RS;) and comparative 
(RS,) reporting systems. (Although not reported, F ratios—associ- 
ated with one and 592 degrees of freedom—of 22.95 for simple 
effects between А, and A, at the RS; level and of 22.43 for simple 
effects between RS, and RS, at the A, level were significant consid- 
erably beyond the .01 level.) 


Conclusions 


т. following conclusions may be formulated: 


It would appear that the individualized reporting system as com- 
pared with the competitive A to F reporting system was associated 
with higher reading performance, a more favorable attitude toward 
school, and a greater sense of self-responsibility. 

pils of relatively high ability differed little in their level 
of self-responsibility irrespective of how their progress in school 
was evaluated and reported, children of relatively low ability dis- 
played a higher level of self-responsibility when their work was 
Judged and reported in an individualized manner with narrative 
statements embodying feedback than when their work was eval- 
uated and Teported in a traditional competitive A to F symbolic 
forme 5 Apparently attitude toward school as measured as well as 
cognitive performance in reading were not dependent upon inter- 


action effects between ability level and mode of reporting pupil 
progress. i 
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Discussion: Implications for Improving Validity of 
the Evaluation of Pupil Progress 


It would appear that the feedback provided by detailed narrative 
statements in reports of pupil attainments could have accounted in 
large part for the higher average standings in the reading measure. In 
turn it would not seem unexpected that as cognitive performances 
improve, both more favorable attitudes toward school and a height- 
ened sense of responsibility for one's own accomplishments would 
occur. In the instance of the interaction between ability level and mode 
of reporting pupil performance it appeared that pupils of lower ability 
levels as compared with those at higher ability levels might have been 
aided to a greater degree in their acquisition of a heightened level of 
self-responsibility. With greater lapse in time similar interactive effects 
might have taken place relative to reading achievement and attitude- 
toward-school variables. From the standpoint of improving the valid- 
ity of evaluation of pupil progress the individualized report system as 
compared with the traditional comparative and competitive A to F 
system would appear to have the following advantages: (а) specificity 
of feedback regarding mastery of designated course objectives, (b) 
means for on-going formative evaluation permitting changes in cur- 
riculum and in instructional strategies in view of improved diagnosis 
of areas of strength and weakness, and (c) improved basis for commu- 
nicating to parents and pupils those cognitive and affective behaviors 
that might be modified to facilitate pupil growth and self-actualiza- 
tion, Such facilitative effects associated with individualized reporting 
Systems may be relatively more important for children of lower levels 
of ability than for those of higher levels of ability. 
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INTERRELATIONSHIPS AMONG 76 INDIVIDUALLY- 
ADMINISTERED TESTS INTENDED TO REPRESENT 76 
DIFFERENT STRUCTURE-OF-INTELLECT ABILITIES AND A 
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For a sample of 34 nine-year-old children from a southern Califor- 

nia middle-class suburban community both the scores of 76 individ- 

l ually administered structure-of-intellect (SOI) tests constructed or 
| selected to duplicate exactly 76 hypothesized SOI abilities and the 
scores on the verbal (V), nonverbal (NV), and composite (C) scales 

of the Lorge-Thorndike Intelligence Tests (LT), Multi-Level Edition, 

were intercorrelated and factor analyzed to determine the extent of 

overlap of SOI ability measures, their degree of relationship with the 

LT scales, and the possible presence of second order factors among 

the SOI tests. Although the data revealed a range in magnitude of the 

2850 correlation coefficients among the SOI measures from —.47 to 

.69 (median coefficient, .13), the values for the ranges and the aver- 

age magnitudes of intercorrelation coefficients of SOI tests within 

single categories of the same operations, contents, or products dimen- 

sion of the SOI model did not differ appreciably from those corre- 
sponding values and magnitudes found between categories from dif- 


‘This report is based on research findings from a Title VI B project, Project Number 
| 19 64576-4123-02, entitled “Learning How to Learn," funded by the Department of 
|. Health, Education, and Welfare, United States Office of Education. 
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ferent SOI dimensions. In the absence of identifiable meaningful 
second order factors or dimensions, there was, however, the sugges- 

| tion of a factor of general intellectual function in view of the high 
loadings of several SOI tests on the same factor as that on which the 
LT-V and LT-NV scales were heavily saturated. A weighted com- 
bination of eight to ten SOI ability tests could afford a potentially 
valid representation of the complex of functions or of the general 
function being measured by the LT-V and LT-NV scales. 


Рок a sample of 34 children who resided in a middle-class suburban 
community in southern California the purposes of this descriptive 
correlational investigation were (1) to ascertain the degree of inter- 
relationship among 76 individually-administered tests designed to du- 
plicate exactly 76 factors in the Structure-of-Intellect (SOI) model 
(Guilford, 1967), (2) to determine the extent of relationship of each of 
these SOI tests with the verbal, nonverbal, and composite scales of the 
1964 Lorge-Thorndike Intelligence Tests, Multi-Level Edition (LT-V, 
LT-NV, and LT-C), and (3) to obtain evidence regarding the nature of 
second-order factors, if any, that would describe the interrelationships 
among the measurable constructs in the SOI model. The research to be 
reported is an outgrowth of a school district project that was initiated 
to provide individualized instruction to children in developing in- 
creased competencies in their use of several SOI abilities. It was 
thought that the correlational data would furnish important informa- 
tion concerning the degree of overlap among SOI tests as well as an 
indication of the extent to which such tests are related to an intended 
measure of so-called general intelligence. In addition, evidence could 
also be obtained to determine whether certain categories within each 
of the three broad dimensions of operations, contents, and products in 
the SOI model tend to be highly interrelated and thus possibly in- 
dicative of higher order constructs of intellectual function. 


Methodology 
Sample 


Т ліне of 34 children of whom there were 12 boys and 22 girls, 
ranged in age from 9 years 0 months to 9 years 11 months. Only those 


Pupils were selected Whose deviati ite scale 
of the LT fell between 85 and 1 15. я 


Tests 


To furnish a measure of 


5 eneral i he LT In- 
tellised Tests, Mull general intellectual status the 


Level Edition, Levels A and B, Form 1 were 
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administered approximately 3 to 12 months prior (о the time that the 
children were given the SOI tests. Deviation IQ scores were deter- 
mined on the previously cited LT-V, LT-NV, and LT-C scales. 

Through use of models of test items provided by Guilford (1967), 
Guilford and Hoepfner (1971), Meeker (1969), and reports from the 
Aptitudes Research Project at the University of Southern California, a 
committee of teachers, teacher aides, and school psychologists, who 
had had special training in a workshop during the previous summer, 
constructed and in a few instances selected tests to represent the 76 
SOI abilities. These tests were carefully reviewed and edited and in 
many instances subsequently revised by psychologists associated with 
the school project. 

To portray the five operations (inferred psychological processes) of 
cognition (C), memory (М), evaluation (E), convergent production 
(N), and divergent production (D), 16, 14, 16, 15, and 15 tests, respec- 
tively, were devised or selected. In the contents dimension, the num- 
bers of these same tests to represent the figural (F), symbolic (S), and 
semantic (M) categories of given information to be processed were, 
respectively, 25, 23, and 28. For the six products (new information 
arising from processing of given information) of units (U), classes (C), 
relations (А), systems (S), transformations (T), and implications (Г), 
d corresponding frequencies of tests employed were 13, 15, 14, 10, 

5, and 9. 

In Table 1 each of the 76 SOI tests is designated in terms of the 
trigram notation employed by Guilford (1967) to describe an hypoth- 
esized ability factor. The first letter refers to the operations category; 
the second, to the contents category; and the third, to the products 
Category. One can interpret a trigram as revealing what operation is 
being used to process given information (content) to bring about new 
information (product)—that is, operating on content to obtain a prod- 
uct. For example, the trigram CFS, which is cognition of figural 
systems, indicates that an examinee would use the operation of cogni- 
tion to process figural content to obtain a product in the form of a 
System. 

Because of the exploratory nature of this investigation and because 
ofthe somewhat limited size of the sample, it did not appear feasible to 
undertake the almost prohibitive amount of expense and time required 
to obtain estimates of the reliability of the 76 SOI tests. In a few 
instances some commercially available tests were employed to dupli- 
Cate a given SOI factor. An effort was made to approximate tests that 
had been used in the Aptitudes Research Project of the University of 
Southern California, but to adapt the difficulty level. Thus the esti- 
Mates of reliability provided by Guilford and Hoepfner (1971) would 
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not be entirely appropriate. Communality estimates arising from the 
factor analysis of the correlational data as well as the highest correla- 
tion which any test exhibited with another test in the correlation 
matrix could probably be taken as lower bound approximations to 
reliability. In general, it appeared that virtually every SOI test would 
yield an estimate of reliability in excess of .40 and usually an estimate 
greater than .50. 


Statistical Analysis 


Pearson product-moment coefficients of correlation were found be- 
tween sets of scores on all SOI tests and on the LT-V, LT-NV, and LT- 
C scales. These correlational data are summarized descriptively in 
Tables 1 through 5. The only inferential interpretation that could be 
rendered would be in relation to the correlation coefficients presented 
in Table 1. For a sample of 34 subjects, coefficients of .34 and .44 are 
required for significance at the .05 and .01 levels, respectively. Signifi- 
сапсе tests of median values of a distribution of correlated or depend- 
ent correlation coefficients were not known to the writers. Hence, the 
correlational data in Tables 2, 3, 4, and 5 could be treated only 
descriptively. 

A principal components factor analysis followed by rotation to 
satisfy the Varimax criterion was completed. Because of the presence 
of more test variables than of subjects, unities were inserted in the 
diagonals of the correlation matrix, although an analysis based upon 
the insertion of the highest column correlation coefficient for the 
diagonal entry would not be expected to alter the basic factor structure 
to a substantial degree, Only those rotated factors which had at least 


three variables with loadings equal to or greater than .35 were exam- 
ined for possible interpretation. 


Findings 


In relation to the 
intercorrelations am 
be summarized: 


1. The range in values о 


first objective of the investigation concerning the 
Ong the 76 SOI tests, the following outcomes may 


f the 2850 correlation coefficients was from 
7:47 to .69, and the median coefficient was equal to .13. 

2 For the dimension of operations, the data for which are summa- 
rized in Table 2, the median coefficients within the five categories 
and between the 10 pairings of categories were, respectively, 18 
апа .11. Within categories, the highest median coefficient of .28 
was for tests of. cognition, and the lowest median coefficient of 
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TABLE 2 
p» of Five SOI Operations Categories: Number of Tests in Each Category, Number of Test 
correlations within Each Category (Diagonal Entries) or between Each Pairing of Categories ( Of 
Diagonal Entries), Median Values of These Intercorrelations, and Their Ranges* 
Categories (1) (2) (3) (4) (5) 
| and Cognition Memory Evaluation Convergent Divergent 
| Numbers of Production Production 
| Tests (C) (M) (E) (N) (D) 
| 16 Tests 14 Tests 16 Tests 15 Tests 15 Tests 
| (1) Cognition (C) (120) (224) (256) (240) (240) 
16 Tests 28 gu 20 25 1 
-2410.67  —.3210.57 —.3010.63 —.35 to 68 —.4410.65 
.. Q) Memory (M) (224) (91) (224) (210) (210) 
14 Tests 17 11 08 14 01 
—3210.57 —.28to.47 -4610.52  —.46to.60 —.39 to .48 
(3) Evaluation (E) (256) (224) (126) (240) (240) 
16 Tests 20 .08 18 17 11 
1 — 3010.63 —.46to.52 —.2610 52 —.2910.60  —.38to.56 
[Convergent Production (№) (240) (210) (240) (105) (225) 
15 Tests 25 14 317, 418 08% 
Я — 3540.68 -.4610.60 —.29to.60 —.2610 69 —.4110.51 
Divergent Production (D) (240) (210) (240) (225) (105) 
15 Tests 11 01 .11 .08 16* 


- 40.65 —39048 —.3810:56 0 51 — 09 to .58 


imber of intercorrelations of tests (entries іп 


his lable and in the two tables to follow, the first row of each cell indicates the nui 
the range in values of the intercorrelations. 


less): the second row, the median value of the intercorrelations; and the third row, 


11 was for tests of memory. Between pairings of categories the 
two highest median coefficients of .25 and .20 were for the respec- 
tive permutations of cognition and convergent production and of 
cognition and evaluation; the lowest median coefficient of .01 
was associated with the pairing of divergent production with 
memory. 

3. For the dimension of contents, the data for which are set forth in 
Table 3, the median coefficients within the three categories and 
between the three pairings of categories were, respectively, .15 
and .15. Within categories, the greatest median coefficient of 20 
was for tests involving symbolic content, and the lowest median 
coefficient of .13 was for measures of figural stimuli. Between 
Pairings of categories, the highest median coefficient of .16 was 
for the permutation of figural material with symbolic content; 
the lowest median coefficient of .12 arose in conjunction with the 
permutation of figural and semantic items. 

4. For the dimension of products, the data for which are presented 
in Table 4, the median coefficients within the six categories and 
between the 15 pairings of categories were, respectively, .12 and 
14. Within categories, the highest median coefficient of .28 was 
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TABLE 3 
Permutations of Three of the Four SOI Contents Categories ( Behavioral Omitted), 
Number of Tests in Each Category, Number of Test Intercorrelations within 
Each Category (Diagonal Entries) or between Each Pairing of Categories 
(Off Diagonal Entries), Median Values of These Intercorrelations 
and Their Ranges 


Categories (1) (2) (3) 
апа Figural Symbolic Semantic 
Numbers of (F) (S) (M) 
Tests 25 Tests 23 Tests 28 Tests 
(1) Figural (Е) (300) (575) (700) 
25 Tests 413 16 12 
—.3lto.58 —.47 to .69 —.46 to .60 
(2) Symbolic (S) (575) (253) (644) 
23 Tests 16 20 115 
—.47 to .69 —.26 to .64 —.42 10.68 
(3) Semantic (M) (700) (644) (378) 
28 Tests 12 45 15 
—.46 to .60 —.42 to .68 —.46 to .60 
TABLE 4 


Permutations of Six SOI Products Categories; Number of Tests in Each Category, Number of М 
Intercorrelations within 


0 Each Category (Diagonal Entries) or between Each Pairing 
of Categories ( Off Diagonal Entries), Median Values of Their Intercorrelations 
and Their Ranges 


\ (5) (6) 
Categories (1) (2) (3) (4) Transfor- ПИ 
N and Units Classes Relations Systems mations tions 
И (DX C) (в) (S) (T) pee 
‘ests 13 Tests 15Tests 14Tests 10 Tests 15 Tests 9 Tes! 
(1) Units (U) (78) (195 117) 
182 130 195 ( 
13 Tests E opa i Mti Miro efc 401 
74610 47 —.311056 -.3210.52 —.35to 58 -.4710.54 - 4610: 
(2) Classes (C) (195) (105 М r 135) | 
210 150 225) ( 
15 Tests И cA FT ins "rw gn d 
А 7310.50 —.28 10.55 —.3510.65 —.33to .64 —.39 to .56 —.31 to" 
(3) Relations (R) 08) (оюу 91) (40) Quo) 120 
14 Tests 0 22,2 28 19 22 3i 
7210.52 -.35 (0.65 -.041.56 -.2110.69 —.22 10.68 -.2019: 
к. (30 (19) (0) ^ (45 08) 0) 
ests ae m 19 43 14 nd 
) 7-3510.58-.3310.64 -.2110.69 -.1810.55 —.31 to.56 —2919^ 
ОА (T) (195) (25) an 69 An 55 d (139) 
s 07 i 22 14 12 й 
; 4110 54 — 39 to 56 -.2210 68 —.31 10.56 —.39 to 38 — 4410: 
(6) Implications (1) (117) ETE as SN Быз г 36) 
135 5 ( 
9 Tests .08 ey at ii ssi y 


7610.58 —311049 — 3010.50 —2910 52 —44to 49 —32103 
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for tests involving relations, and the lowest median coefficient of 
.08 was for tests representing implications. Between pairings of 
categories the three highest median coefficients of .22, .22, and 
.19 were for the respective pairings of classes with relations, 
relations with transformations, and relations with systems; the 
two lowest median coefficients (both .07) were associated with 
the pairings of units with transformations and of classes with 
implications. 

Relative to the second objective of the study which was concerned 
with the degree of relationship of each of the SOI tests with the V, NV, 
and C scales of the LT, complete data are provided in Table 1, and 
summary information of the frequency distributions of the correlation 
coefficients between deviation IQ scores on the LT-C scale and scores 
on the SOI tests for each of the categories from the operations, con- 
tents, and products dimensions is set forth in Table 5. In view of the 
presence of a correlation of .94 between the LT-V and LT-NV scales, 
only data pertaining to the correlation of SOI tests with the LT-C scale 
are described in Table 5. The following results may be summarized: 

1. From the entries in Table 1, it is apparent that the three SOI tests 

with the highest coefficients of correlation with the composite 


TABLE 5 

quency Distributions of Correlation Coefficients between ІО Scores on the Composite Scale of the 
Lorge-Thorndike Intelligence Tests, Multi-Level Edition, and Scores on SOI Tests Designed to 
Represent Each of the Categories within the Operations, Contents, and Products Dimensions 


Class Intervals Categories within Dimensions 
for Correlation Operations Contents Products 
Coefficients (ғ) CM ELEN DIAL ом. В: палит 


7010.79 
-60 to .69 
5010.59 
4010.49 
300.39 
2010.29 
100,19 
0010.09 
0010-09 
7.10 to —.19 
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scale of the LT were EFR (.72), NSS (.62), and ESR (.60). The 
SOI test with the lowest coefficient was DMU with a value of 
ust 

2. From the information cited in Table 5 it is evident that the 
highest coefficients appeared between the LT-C scale and those 
SOI tests reflecting the two operations of cognition and eval- 
uation, content designated as symbolic, and the products of rela- 
tions, classes, systems, and transformations. Correspondingly, 
the lowest levels of association occurred for the SOI tests in- 
volving the operation of divergent production, semantic content, 
and products of implications and units. 

For the third objective of this study which was directed toward 
identifying possible second order factors (dimensions) underlying the 
intercorrelations of the 76 SOI tests, no tabular data are presented in 
view of the limitations in space and in light of the absence of readily 
interpretable results. The following outcomes are summarized: 

1. Although 22 rotated factors with three or more loadings equal to 
or exceeding .35 emerged, the psychological meaningfulness of 
these factors was difficult to establish, as no definitive, clear-cut 
cluster or pattern of SOI tests representing common hypoth- 
esized characteristics appeared without the presence of another 
Cluster or pattern affording a contradictory or alternative inter- 
pretation. 

2, Perhaps the only factor that could be interpreted, at least tenta- 
tively, was the first one which yielded loadings оп 22 SOI tests 
greater than or equal to .35 and which exhibited weights on 8 
SOI tests of .50 or higher as evidenced by these SOI tests and 
their corresponding factor saturations: NSS, .73; EFR, .68; 
MER, .65; ESR, 61; NFR, .55; СЕС, .54; EMC, .53; and CFT, 
50. Thus it would appear that a general intellectual factor was 
present, which embraced mostly operations of convergent рго- 
duction, evaluation, and cognition upon primarily figural and 
symbolic Content resulting in new information or products 
largely in the form of classes, relations, and systems. The general 
factor Interpretation was further supported by the presence of 
loadings of .94 on both the LT-V and LT-NY scales. 


Conclusions 


Although there were marked fluctuations in the intercorrelations 
among the 76 SOI tests which could be attributed in part to the small 
sample size and to the limited degree of reliability of several of the 
shorter SOI tests, the following conclusions seem to be justified: 

1. Some indication of the stability in the measured representation of 
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a given SOI category within any one of the three dimensions of 
operations, contents, or products was evident from the small but 
positive value for the median correlation coefficient between SOI 
tests within that category. 

2. Furthermore, the size of the median coefficient between SOI tests 
from different categories within the same dimension suggested 
that positive though quite modest relationships existed between 
different categories in each of the three dimensions of operations, 
contents, and products. 

3. In general, the average magnitudes of the interrelationships of 
SOI tests from the same category within a given dimension did 
not differ appreciably from those of the interrelationships of SOI 
tests from different categories within the same dimension. Thus 
for any one of the three SOI dimensions, tests designed to meas- 
ure the same characteristic in that dimension exhibited inter- 
correlations of about the same magnitude as those for tests de- 
vised to measure different characteristics in that same dimension. 

4. That several of the SOI tests intended to represent quite different 
abilities showed substantial correlations with the LT scales as 
well as large loadings on the same factor or dimension as that on 
which the two LT subscales were heavily weighted suggests either 
that the LT is a highly complex instrument factorially, or that a 
general factor of intellectual function or of test-taking strategies 
might indeed exist. Additional efforts involving variations in 
factor analytic procedures might furnish evidence of a mean- 
ingful structure of second order factors which this factor analysis 
failed to reveal. (Incidentally, it should be mentioned that for the 
five subtests of vocabulary, sentence completion, arithmetic rea- 
soning, verbal classification, and verbal analogies in the LT-V 
scale and for the three subtests of figure analysis, number series, 
and figure classification in the LT-NV scale the writers hypoth- 
esized that the respective factors would be CMU, EMS or CMS 
(depending on the maturity of the child and the level of item 
difficulties), CMS, CMC, CMR or EMR (depending on the ma- 
turity of the child and the level of item difficulties), CFR, CSS, 
and CFC. These hypothesized factors did not correspond too 
closely to those for the eight SOI tests that yielded the previously 
enumerated loadings in excess of .50 on the first factor on which 
both the LT-V and LT-NV scales were weighted .94. 


| Recommendations 


From a developmental point of view this study needs to be replica- 
led with a larger group of children than was possible in this in- 
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vestigation and at different age levels, as the pattern of inter- 
correlations and the factor structure of measures of SOI abilities are 
probably related to the maturity levels and experiential backgrounds 
of examinees. In particular, a combined longitudinal-experimental 
investigation is recommended in which one of two large comparable 
groups of children does receive formal training in the development of 
SOI abilities and the second group does not. Application of parallel 
instrumentation at different time points would afford an indication of 
how interrelationships among measures of SOI abilities change as a 
function both of chronological age and of the amount of formal 
exposure to learning experiences intended to enhance the acquisition 
of these abilities. 
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THE RELATIONSHIP OF ACHIEVEMENT ON A TEACHER- 
MADE MATHEMATICS TEST OF COMPUTATIONAL SKILLS 
TO TWO WAYS OF RECORDING ANSWERS AND TO TWO 
WORKSPACE ARRANGEMENTS 


GENE W. MAJORS 
Anaheim Union High School District 


JOAN J. MICHAEL 
California State University, Long Beach 


The performance of one sample of 120 seventh-grade students and 
of one sample of 120 eighth-grade students on a 30-item teacher- 
made test of computational skills was related to (a) transcribing the 
item responses to numbered spaces in an answer column on a зера- 
rate sheet vs, writing the item response on the test form itself and (b) 
providing workspace on the test form itself vs. not providing such 
space. From the use of a 2 X 2 quasi-experimental design, statist- 
ically significant differences were obtained with respect to both main 
effects (p < .05) for the sample of seventh-grade students and with 
respect to the single main effect of mode of recording item response 
(p < .001) for the sample of eighth-grade students. When the data 
меге interrupted descriptively rather than inferentially there was the 
strong suggestion that recording answers directly on the examination 
form was associated with a higher average level of student perform- 
ance than that realized when a detached answer column was used, 
For the sample of seventh-grade students, provision of working 
Space on the test form was observed to yield higher average scores 
than when such a provision was not made. 


For one sample of 120 seventh-grade students in four separate 
mathematics classes and for a second sample of 120 eighth-grade 
Students also in four separate mathematics classes in a junior high 
School which is located in a mixed socioeconomic community in Or- 
ange County, California, the purpose of the investigation was to deter- 
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mine whether average level of achievement in a 30-item teacher-made 
test of arithmetic computation was related to (a) the presence of a 
detached answer column appearing on a separate sheet of paper and 
containing numbered spaces in which the student would insert his 
answer or the absence of such an answer column with the result that 
the student would write his solution directly on the test paper itself and 
(b) the presence of work space on the test paper itself or the absence of 
such workspace requiring the student to do his computations on a 
separate sheet of scratch paper. Although some research has been 
done with primary and elementary school children regarding use of 
answer sheets (Cashen, 1969; Gaffney, 1971; and McKee, 1967), with 
slow learners (Clark, 1968), and with culturally disadvantaged pupils 
(Soloman, 1971), virtually no studies have been reported on use of 
answer sheet formats with junior high school children (Miller and 
Minor, 1963). No research was known to the writers concerning the 
possible relationship of test performance to the availability or lack of 
availability of workspace on the test sheet itself. Thus it appeared that 
important implications for the validity of testing procedures under- 
lying arithmetic computational tasks might be forthcoming from an 
investigation of four formats of testing involving permutations of two 


modes of recording answers and two types of spatial provisions for 
working problems, 


Methodology 


At each of the two grade levels a quasi-experimental 2 X 2 factorial 
design was employed involving use of four intact classes (30 students 
per class), each of which was randomly assigned to one of the four 
treatments (formats). Pretest data provided by the Comprehensive 
Test of Basic Skills (CTBS)—Mathematics Computation, Level 3, 
Form Q, revealed no Statistically significant differences among the 
means of the four classes of seventh-grade pupils, F (3, 116) = 1.49, p 
> .05, or among the means of the four classes of eighth-grade pupils, F 
(3,116) = LOL, p > .05. Thus, in terms of CTBS scores no systematic 
differences existed in the average level of computational skills of the 


four classes within each grade level. Hence the groups were considered 
comparable, 


For the 30-item teacher- 
addition, subtraction, mul 
integers and fractions an 


made test in arithmetic, which consisted of 
tiplication, and division problems involving 
^ i d which was timed at 40 minutes, the four 
Tmats representing the treatments were as follows: (1) workspace 
provided on the test paper (form) and a detached answer column on 8 
separate sheet for reporting answers (WS—AC), (2) workspace pro- 
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vided on the test paper and по detached answer column resulting in the 
examinee's having to write answers on the test paper (WS—NAC), (3) 
по workspace provided on the test paper (though scratch paper was 
furnished) and a detached answer column on a separate sheet for 
reporting answers (NWS—AC), and (4) no workspace provided on test 
paper (though scratch paper was furnished) and no detached answer 
column resulting in the examinee's having to write answers on the test 
paper (NWS—NAC). This design permitted a two-way analysis of 
variance of the resulting scores at each grade level. 


Findings 


At the seventh-grade level, the means and standard deviations of the 
scores of the four subsamples exposed to the four treatments ws— 
AC, WS—NAC, NWS—AC, and NWS—NAC were, respectively, 
24.20 and 4.74, 26.13 and 3.81, 22.80 and 3.92, and 24.40 and 4.15; at 
the eighth-grade level, the corresponding means and standard devia- 
tions were 21.07 and 4.10, 24.43 and 4.01, 19.73 and 3.81, and 23.90 
and 3.89. The analyses of variance of the scores from which these 
statistics were derived are summarized in Table 1. It is evident that for 
the seventh-grade sample both main effects were statistically signifi- 
cant beyond the .05 level (but not at the .01 level) but that for the 
eighth-grade sample only the main effect pertaining to the presence or 
absence of a detached answer column was statistically significant (ac- 


TABLE 1 
Two-Way Analyses of Variance of Scores on the Teacher-Made Mathematics Test 
of Computational Skills for ‘Seventh and Eighth-Grade Samples 


Seventh Grade Sample 


Workspace (WS) 73.63 1 73.63 4.23* 
Answer Column (АС) 93.63 1 93.63 5.38* 
Interaction 0.83 1 0.83 0.05 
Within Samples 2018.27 116 17.40 - 
Total 2186.36 119 185.49 
Eighth Grade Sample 
Workspace (WS 26.13 1 26.13 1.67 
Answer poA 425.63 1 425.63 2722794 
Interaction 4.80 1 4.80 0.31 
Within Samples 1813.80 116 15.64 = 
Total 2270.36 119 472.20 


t Significant at or beyond the .05 level. 
Significant at or beyond the .001 level. 
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tually beyond the .001 level). No statistically reliable interaction ef- 
fects between treatment factors occurred for either sample of 120 
students. 

For the seventh-grade sample, the means and standard deviations of 
the two subgroups of 60 subjects allowed or not allowed work space 
(WS vs. NWS) irrespective of answer column availability were, respec- 
tively, 25.17 and 4.38 and 23.60 and 4.08. Corresponding statistics for 
the two subgroups at the eighth-grade level were 22.75 and 4.36 and 
21,82 and 4.36. Relative to the presence or absence of a detached 
answer column (AC vs. NAC), irrespective of the presence or absence 
of workspace provided, the means and standard deviations were, re- 
spectively, 23.50 and 4.37 and 25.27 and 4.37; similarly, at the eighth- 
grade level, the means and standard deviations were, respectively, 
20.40 and 3.98 and 24.17 and 3.92. 

In the sbsence of any theoretical orientation that would permit 
directional predictions of outcomes associated with the treatments in 
this experiment, significance tests were nondirectional. Thus at a de- 
scriptive level but nor at an inferential level it is apparent that for the 
seventh-grade sample but to a much lesser degree for the eighth-grade 
sample, the mean performance was higher for the group of students 
provided with workspace on the test paper than for that group not 
given such a space allowance. Again at a descriptive level, it is evident 
that for the group of seventh-grade students and particularly for the 
group of eighth-grade students not given the detached answer column 
(on a separate sheet) the means were higher than those means for the 


groups of students required to record their responses in a detached 
answer column, 


Discussion 


ea ihis Study revealed significant differences in average 
Le cipue Оп а test of arithmetic computation for both seventh- and 
ме ишиц depending upon whether a detached answer 
item Бис" n the test paper itself was employed for recording 
Р ses. For seventh-grade but not eighth-grade students à 
cw ed тепсе between means occurred in relation to whether 
the findin d 9r Was not provided on the test form. Interpretation of 
that de 5 Adesriptive rather than an inferenctial level suggests 
their probi 28 "= to insert the numerical entries obtained for 
them Vis these tions on a separate answer sheet rather than having 
in | numerical answers on the test paper itself could result 
ower levels of test performance. Furthermore, at least for one 
sample of seventh-grade Students, the evidence also suggests that pro- 
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viding workspace on the test sheet would be facilitating to students 
demonstrating their level of skill in arithmetic computational tasks. 
Thus, it would not seem unreasonable to believe that variations in 
testing procedures in relation to the format of answer sheets and the 
provision of working space on the examination booklet could affect 
the validity of the scores of students taking examinations in arithmetic 
that emphasize computational skills. 
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| THE CONCURRENT VALIDITY OF THE PRIMARY SELF- 
— CONCEPT SCALE FOR A SAMPLE OF THIRD-GRADE 
CHILDREN 


JOAN M. JENSEN AND JOAN J. MICHAEL 
California State University, Long Beach 


WILLIAM B. MICHAEL 
University of Southern California 


In addition to estimates of the reliability for the eight factor scales 
of the 24-item Primary Self-Concept Scale (PSCS) obtained from 
two administrations to a sample of 83 children in the third grade, 
concurrent validity coefficients of the eight scales were determined 
relative to the same eight factors on the Teacher Questionnaire (TQ) 
designed to reflect teachers’ perceptions of children's behaviors in 
these eight factor categories. Scores on four scales of the PSCS on its 
first administration and on three scales on its second administration 
yielded statistically significant validity (phi) coefficients with scores 
on corresponding factor categories of the TQ. 


Fora sample of 62 children ranging in аре from 8 years 9 months to 
years 2 months from the third grade of one school located in a low- 
iddle socio-economic area of southern California, the purpose of 


this study was to determine the degree of concurrent validity between 
item Primary Self- 


On eight corresponding factor categories а 
__ Questionnaire (TQ) prepared by the fir 
| Thus the relationship between self-repo 
ight scales of the PSCS with teacher observati 
“Categories in the TQ was sought as a means о 
ў evidence of the validity of the PSCS. 


f obtaining some 
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Description of the PSCS and TQ 


Designed to provide an economic procedure for evaluation of 
several characteristics of self-concept relevant to school success, the 
PSCS was specifically constructed for use with children of Spanish or 
Mexican families in the Southwest, although Muller and Leonetti 
(1972) reported that the test was appropriate for use with children 
from the Anglo culture. On the basis of a factor analysis of a 
preliminary form, Muller and Leonetti redesigned the instrument by 
retaining items with high factor loadings and by adding 10 new items. 
Thus they had a 24-item test intended to reflect eight factors (although 
Item 23 dealing with two shades of skin color was omitted from 
Scoring in this investigation). In pictorial form each item depicts at 
least one child in a positive role and at least one child in a negative role 
with the exception of Item 23. After being told a simple descriptive 
story about each of the illustrations, a child is instructed to draw a 
circle around the person that is most like himself. In Table 1, each of 
the eight intended factors, a brief description of the factor, the 
numerical designations of the items corresponding to the factor, and a 
brief statement of the situation portrayed by the item are presented. 

The TQ contains 12 pairs of opposite-meaning words or phrases 
that were selected from words or phrases employed in the description 
of the eight factors of the PSCS set forth in Table 1. These contrasting 
pairs of words or phrases were subsumed under a factor category 
heading and were placed side by side in a left-right direction. Above 
cach of these two words or expressions, a line about 1.25 inches long 
was placed so that а teacher could insert a check mark in the blank 
formed by the line above the word or expression which was perceived 
to represent behavior more like than unlike that of the child. The eight 


1); (2) Relationship with peers II: ostracized —accepted (Item 2); (3) 


eacher rejection (Item 8) апа 
parental acceptance— parental rejection (Item 9); (7) Emotional self: 
happy—sad (Item 10) and angry— not angry (Item 11); and (8) Tasks 
undertaken (success level): successful —unsuccessful (Item 12). 

In the PSCS each item was Scored one point for a socially acceptable 
answer and zero points for 4 socially undesirable response as 
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TABLE 1 
Primary Self-Concept Factors and Associated Item Descriptions 


Factor Number and Description 


Item Number and Description 


. Peer aggressiveness or cooperation 
Child's view of himself in 
sharing and cooperating with 
peers. 


. Peer ostracism or acceptance 
Child's view of his acceptance 
by his fellow students. 


. Intellectual self-image 
Child's view of himself as 
a student, and his like or dislike 
of school. 


. Helpfulness 
Assesses child's role as helper or 
helpee as seen by himself. 


. Physiological self 
Child’s view of his physical self: 
large or small, strong or weak, 
dark-skinned or light-skinned 


. Adult acceptance or rejection 
Child’s view of parents and 
teachers as accepting or rejecting 
him. 


. Emotional self 
Laughing or crying, happy or sad, 
angry or not angry 


- Success or non-success 
Child's view of himself 
as to success at task-oriented 


11. Loving dog—hitting dog 
20. Sharing candy—fighting 
22. Fighting—sharing toy 


16. In peer group—out of peer group 
18. Playing with others—playing alone 
24. Playing together—playing alone 


2. Reading—bothering child reading 

4. Looking out window—reading 
10. Doing school work well—not doing well 
19. Reading—playing in school room 


1. Helping—being helped 
8. Climbing—helping child climb 
9. Riding in wagon—pushing wagon 


13. Small child—large child 

15. Small child playing—large child playing 
21. Strong—weak 

23.* Dark child—light child 


3. Helping mother—running from mother 
5. Spanked by mother—loved by mother 
14. Liked by teacher—scolded by teacher 


7. Crying—laughing 
12. Sad—happy 


6. Building house—not able to build house 
17. Fixing puzzle—not able to fix puzzle 


* Unscored item excluded in this research investigation. 


predetermined by the test authors. For each factor scale the score was 
simply the number of points earned, which could vary from zero to 
four depending on the number of items. For a given factor Muller and 
Leonetti considered a score of two or more to be socially desirable. A 


similar procedure was followed for the PQ. 


Data Analyses 


tween scores on corresponding 


Phi coefficients were calculated be orrespor 
after the narrow distribution of 


factors of the PSCS and TQ measures 
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Scores was dichotomized as close to the median value as possible on 
each variable. Phi coefficients were evaluated for significance through | 
use of the chi-square statistic. 

Since the PSCS was given a second time two months after the first | 
administration and that of the TQ, it was possible to obtain two sets of. 
validity coefficients for the ТО as well as test-retest reliability estimates 
of the eight factor scales of the TSCS. Although on the first 
administration of the TSCS 101 children (54 boys and 47 girls) in the 
classes of four third-grade teachers participated, only 83 were present 
for the retest and thus available for providing data necessary to | 
obtaining test-retest reliability estimates of the factor scores. However, 
since only three of the four teachers evaluated children on the TQ who 
had been present for the initial test and the retest of the PSCS, the 
sample size decreased to only 62 for the validity coefficients. 


Findings 


In Table 2 the test-retest (phi) coefficients for the eight factor scales 
of the PSCS as well as the two sets of validity (phi) coefficients are 
cited along with levels of significance. The results may be summarized 
as follows; 
1. Although the first six of the PSCS factor scales showed 
Statistically significant reliability estimates varying from .25 to 
69, four of the scales (1, 2, 5, and 6) of the initial administration 
yielded statistically significant validity coefficients ranging from 
24 to 457, and three of the scales (2, 5, and 6) of the second 
administration furnished statistically significant validity 
coefficients within the span of .29 to .33. 

2. As would be anticipated, PSCS factor scales showing 
nonsignificant or relatively low reliability estimates failed to 
Correlate significantly with the PQ criterion measure. 


Discussion 


Although several of the PSCS s i Es. 137 
initi cal ificant 
initial test-retest ES ев yielded statistically sign 


1015 


JOAN M. JENSEN, ЕТ AL. 


{әлә 10: эці puooq 10 18 1ULDYIUBIS v. 
“әлді SO’ IYI риоќәҷ 20 18 yURDYIUTIS „ 
"uoneSnsaaur sig) ш papnjoxa WI] y 


%ю- ю- %- а 4% (sunsand pojuorio-yse3 ш) 
$$999NS-UOU 10 6592216 "8 
50- 00 00 11 ‘O1 АА 99 |euonourg "L 
0 жей” ТАЗ 6% pI ‘s€ uorofo1 10 291142598 пру 9 
523% «С *6C L'9 EZ "IC “ST “ET Иә jeoraiojorsÁud 5 
0- 6r— «PE 5% 681 ssou|njdjoH v 
LN 9r «ST £ 61 ‘OI f < әЗеш!1-][2$ Penau] 15 
б ж.б” SS [4 УС ‘81 91 32014929 10 uispeijso 194 T 
oc “© *»69` 1 co 4018124009 10 5820921559193 1294 1 
(c9 = N) (29 = м) (£8 = №) uoissa1dx3 19quinN uonduosaq ашты 
Оз чил ©. чил 1521. зиоцеүәпогу 1e[odig шә 101224 pug 103224 
5932 enu ays emu] — 10]19qunN $284 
$) SIUI (Ф) ова 10) A10820 OL 
Aupye A 1иәипәпогу возеш AQP YY 


(O.L) әптииопзәпф 1242021 ays fo 44108210) 40120, Sujpuodsa4407) 81 yum 521025 $954 14317 
әй fo 4203 fo (suonpja410)) Ща) s1u2121/207) «прод рио 5401204 ($254) 2122$ 1do2u0)-ffog 
(mung 14819 241 fo 4203 Јо (suonvjassoD tyd ISalay-tsay |UIIUJ) So1wunsg {111912¥ 
c 318ЯУ1. 


1016 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


and dependent upon the reliabilities of the PSCS factor scores as well 
às upon the reliabilities of the TQ factor category scores, which could 
not be estimated. 

Little is also known of the difficulties that the children might have 
experienced not only in understanding the test tasks but also in 
selecting their choices in the pictorial format of the items of the PSCS. 
Moreover, the existence of possible response sets in the children such 
as acquiescence or social desirability, particularly on the retest, could 
have distorted the outcomes. Furthermore, the lack of teacher 
information about child behaviors sought in certain score categories 
or the lack of clarity or uniformity of meaning intended for a given 
category, which might have occurred from one teacher to another on 
both the PSCS and TQ, could be expected to attenuate the coefficients 
of correlation, 

Although the PSCS showed some initial promise, caution should be 
exercised in its use. Additional research in developmental efforts in 
refining the items, in possibly adding one or two more items to each 
scale, and in correlating the scores on revised scales with behaviorally 
Oriented criterion measures might be anticipated to improve both the 
reliability and validity of the PSCS factor scales. 
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THE RELATIONSHIP BETWEEN SELF-ESTEEM AND 
ANXIETY IN GRADES FOUR THROUGH EIGHT 


MARGARET A. MANY 
Western Illinois University 


WESLEY A. MANY 
Northern Illinois University 


This study examined the relationships between two measures of 
self-esteem and each of two measures of general anxiety and test 
anxiety in a sample of 4,367 pupils, grades four through eight. Coop- 
ersmith's Self-Esteem Inventory (SEI) was used to assess self-esteem. 
Sarason's General Anxiety Scale for Children (GASC) and Test 
Anxiety Scale for Children (TASC) were employed to measure апх- 
iety. There were statistically significant negative correlations between 
the measure of self-esteem and each of the measures of general 
anxiety and test anxiety when scores were analyzed by total group, 
grade level, and sex. Although these correlations tended to be low to 
moderate (—.24 to —.42), they were consistent in suggesting à nega- 
tive relationship between a measurable construct of self-esteem with 
cach of the corresponding constructs of general and test anxiety. The 
implications tend to support the possibility of reducing anxiety in 
elementary and junior high school age pupils by enhancing the way 
in which they see themselves. 


RESEARCH findings generally indicate that persons with high self- 
teem are happier and more effective іп meeting societal demands 
than are persons with low self-esteem. Findings further point to the 
Undesirable consequences that can accrue as а result of extreme anx- 
itty within the individual. Coopersmith (1967) has suggested that this 
anxiety may occur when the individual expects to be or actually is 
ejected by himself or by others. AN] 

In a study involving fourth, fifth, and sixth grade pupils, Lipsitt 
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(1958) found that children with poor self-concept were significantly 
more anxious than were children with good self-concepts. From his 
study with college females Mitchell (1959) concluded that the better 
the self-concept, the less anxiety evidenced. Imbler (1968) obtained a 
significant negative correlation between anxiety and positive self-con- 
cept. In a related study involving a comparison between high and low 
self-esteem subjects Lampl (1968) observed that the low self-esteem 
subjects were higher in anxiety than were the high self-esteem subjects. 
Similarly, studies by Van Buskirk (1961 ), Wittrock and Husek (1962), 
Coopersmith (1967), and Ausubel and Robinson (1969) have provided 
evidence that a negative relationship between level of anxiety and 
favorableness of self-concept or self-esteem appears to exist. 

The purpose of this Study was to examine the relationship betweena 
Measure of self-esteem and (a) a measure of general anxiety and (b) 
One of test anxiety in a large population and in subpopulations of 
elementary school children in grades four through eight. The measures 
employed were Coopersmith’s Self-Esteem I nventory (SEI) and Sara- 
son, Lighthall, Davidson, Waite, and Ruebush's (1960) General Anx- 
lety Scale for Children (GASC) and Test Anxiety Scale for Children 
(TASC), To the writers’ knowledge a study of the relationship between 
Coopersmith’s Self-Esteem Inventory and Sarason's anxiety scales had 
not been previously conducted. This study was intended to provide 
further data concerning the degree of relationships between a con- 
struct of self-esteem and each of two constructs of anxiety in terms of 
their being represented Operationally by these particular instruments 
that have gained increased acceptance by educators, 


Subjects 


f The subjects consisted of 4,367 pupils from public schools in grades 
our through eight in East Aurora and in Wheaton, Illinois. This 


Procedure 


MES Mb GASC, and TASC were administered to all children in 
сі у our through eight of the participating schools. The scales were 
Istributed to each teacher Tesponsible for a grade or class. Both scales 


K 
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TABLE 1 
Correlation between Scores on SEI Scale and Scores on GASC Scale 
and between Scores on SEI Scale and Those on TASC Scale 


Paired Scales N r 
SEI and GASC 4,367 —.280* 
SEI and TASC 4,367 2,90% 


* Significant beyond the .001 level. 


were administered according to standard directions accompanying the 
scales. The two anxiety scales, GASC and TASC, were given to the 
pupils at one testing period and the SEI at another. The scales were 
read to all groups. The testing was accomplished during the second 
Week of May. It was assumed that by this time in the school year 
anxiety a student might feel as a result of being in a new school, 
new class, or new program would be greatly diminished. 

After being assembled in an appropriate form the data were checked 
for accuracy by the research directors. Pearson product moment 
coefficients of correlation were calculated between sets of scores and 
the data were analyzed by total group score, by grade level, and by sex. 


Results 


As is evident in Table 1, significant negative relationships were 
found between the measure of self-esteem and each of those of general 
anxiety and test anxiety for the total population of children. 

When analyzed by grade level, similar statistically significant nega- 
tive relationships were found. As is apparent in Table 2, the highest 
relationships were found among the sixth grade pupils. 


TABLE 2 
Correlation between SEI Scores and (a) GASC Scores and 
(b) TASC Scores within Each of Five Grade Levels 


Paired Scales Grade Level N и 
4 824 -.243 
SEI and GASC 6 $16 2587 
6 960 -.318 
7 854 —.299 
8 913 259 
824 m 
SEI and TASC $ 816 2399 
6 960 — 424 
7 854 —.366 
8 913 2319 


“АШ correlation coefficients significant beyond the .001 level. 
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(1958) found that children with poor self-concept were significantly 
more anxious than were children with good self-concepts. From his 
study with college females Mitchell (1959) concluded that the better 
the self-concept, the less anxiety evidenced. Imbler (1968) obtained a 
significant negative correlation between anxiety and positive self-con- 
cept. In a related study involving a comparison between high and low 
self-esteem subjects Lampl (1968) observed that the low self-esteem 
subjects were higher in anxiety than were the high self-esteem subjects. 
Similarly, studies by Van Buskirk (1961), Wittrock and Husek (1962), 
Coopersmith (1967), and Ausubel and Robinson (1969) have provided 
evidence that a negative relationship between level of anxiety and 
favorableness of self-concept or self-esteem appears to exist. 

The purpose of this study was to examine the relationship between a 
measure of self-esteem and (a) a measure of general anxiety and (b) 
one of test anxiety in a large population and in subpopulations of 
elementary school children in grades four through eight. The measures 
employed were Coopersmith's Self-Esteem I nventory (SEI) and Sara- 
son, Lighthall, Davidson, Waite, and Ruebush's (1960) General Anx- 
iety Scale for Children (GASC) and Test Anxiety Scale for Children 
(TASC), To the writers’ knowledge a study of the relationship between 
Coopersmith’s Self-Esteem Inventory and Sarason’s anxiety scales had 
not been previously conducted, This study was intended to provide 
further data concerning the degree of relationships between a con- 
Struct of self-esteem and each of two constructs of anxiety in terms of 
their being represented operationally by these particular instruments 
that have gained increased acceptance by educators, 


Subjects 


The Subjects consisted of 4,367 pupils from public schools in grades 
four through eight in East Aurora and in Wheaton, Illinois. This 
sample included all students from these grade levels in the two commu- 
nities, except those for whom both lest scores were unavailable and for 
those students who had an anxiety lie score of five or lower. The 
diversity of these two communities, when combined, affords a broad 


range of Socio-economic status as well as representation from different 
ethnic and racial minority groups, 


Procedure 


The SEI, GASC, and TAS 
grades four through eight of t 
distributed to each teacher ri 


C were administered to all children in 
he participating schools. The scales were 
esponsible for a grade or class. Both scales 
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TABLE 1 
Correlation between Scores on SEI Scale and Scores on GASC Scale 
and between Scores оп SEI Scale and Those on TASC Scale 


Paired Scales N r 
SEI and GASC 4,367 —.280* 
SEI and TASC 4,367 -.381% 


* Significant beyond the .001 level. 


were administered according to standard directions accompanying the 
scales. The two anxiety scales, GASC and TASC, were given to the 
pupils at one testing period and the SEI at another. The scales were 
tead to all groups. The testing was accomplished during the second 
week of May. It was assumed that by this time in the school year 
anxiety a student might feel as a result of being in a new school, 
new class, or new program would be greatly diminished. 

After being assembled in an appropriate form the data were checked 
for accuracy by the research directors. Pearson product moment 
coefficients of correlation were calculated between sets of scores and 
the data were analyzed by total group score, by grade level, and by sex. 


Results 


As is evident in Table 1, significant negative relationships were 
found between the measure of self-esteem and each of those of general 
anxiety and test anxiety for the total population of children. 

When analyzed by grade level, similar statistically significant nega- 
live relationships were found. As is apparent in Table 2, the highest 
relationships were found among the sixth grade pupils. 


TABLE 2 
Correlation between SEI Scores and (a) GASC Scores and 
(b) TASC Scores within Each of Five Grade Levels 


(E) TASC Scores within Each of Flue Grade Levet | 

SEI and GASC 4 824 —.243 
zi 5 816 —.287 

6 960 —.318 

7 854 —.299 

8 913 ess 

SEI 4 824 —.388 
and TASC s 6 2200 

6 960 -.424 

7 854 —.366 

8 913 —.319 


“АП correlation coefficients significant beyond the .001 level. 
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TABLE 3 
Correlation between SEI Scores and (a) GASC Scores and 
(b) TASC Scores by Sex 


Paired Scales Sex N А 
SEI and GASC Male 1,997 -.289 
Female 2,320 г-:212 
SEI and TASC Male 1,997 -.377 
Female 2,320 -.381 


* All correlation coefficients significant beyond the .001 level. 


Conclusions and Discussion 


There were statistically significant negative correlations between a 
measure of self-esteem and each of the measures of general anxiety and 
test anxiety when scores were analyzed by total group, by grade level, 
and by sex. Although these correlations tended to be low to moderate 
(-.24 to —.42), they were consistent in Suggesting a negative relation- 
ship between a measurable construct of self-esteem with each of the 


selected generally supported the Outcomes of other similar research. 
Although a correlational Study of this nature does not deal with cause 
and effect relationships, the implications tend to support the possi- 
bility of reducing anxiety in elementary and junior high school age 
children by enhancing the way in which these children see themselves. 
It would appear reasonable to Suggest that efforts to provide opportu- 
nities for successful achievement should be undertaken in an environ- 
ment that reinforces the adequacy and worthiness of the individual 
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Eighty-four second grade children were administered the Lorge- 

rndike Test of Cognitive Abilities (LTTCA), the Test of Au- 

Perception (TAP) and, five months later, the Metropolitan 
ievement Test (MAT). Pearson product-moment 


istically significant became non-significant except those between 

TAP subtests and the MAT Word Analysis subtest. It was con- 

led that performance in only those aspects of an academic pro- 

most directly related to auditory perception can be predicted 
ing the ТАР. 


Test of Auditory Perception (TAP) (Sabatino and Foster, 
is an experimental device designed to facilitate prescription of 
ional programs for learning disordered children. It has been 
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well established that auditory perception is related to academic 
achievement, usually measured by reading ability (Benger, 1968; Rob- 
inson and Hanson, 1968; Wepman, 1960). It has, therefore, been 
recommended that information relating to auditory perception be 
used in formulating educational programs (Johnson and Myklebust, 
1967; Wepman, 1960). Empirical evidence, however, does not support 
the efficacy of this practice (Waugh, 1973; Ysseldyke, 1973). Ysseldyke 
(1973) suggested that at least part of the difficulty in prescribing 
academic programs on the basis of a child's auditory or visual per- 
ceptual abilities is the “lack of reliable and valid devices which may be 
used to identify behavioral (ability) strengths and weaknesses in chil- 
dren” (p. 26). 

The purpose of this Study was to establish the predictive validity of 
the TAP relative to academic achievement, and to verify that auditory 
Perception as measured by the TAP is a unique ability construct, i.e., 
that it is not primarily a measure of general intelligence, 


Procedure 
Subjects 


The pupils in an entire second grade (n = 84) attending a rural 
elementary school Participated in this study. 


Experimental Instrument 


The Test of Auditory Perception (TAP) consists of three subtests: 
(1) Phoneme Discrimination requiring differentiation among three 
similar Sounding nonsense syllables, (2) Word Recognition involving 
identification of a legitimate word among a set of similar sounding 
Nonsense words, (3) Sequencing necessitating the matching of a series 
Of nonsense syllables to one of three multisyllabic nonsense words. 
KR-20 reliabilities on the subtests range from r = .896 to r = .740 
depending on Subtest and age level. 


Method 


die At were administered the Lorge-Thorndike Test of Cogni- 
TAP. Fi ities (LTTCA) (Lorge, Thorndike, and Hagen, 1964) and the 

; “Ive months later the battery of Metropolitan Achievement Test 
MAD (Durost, Bixler, Wrightstone, Prescott and Balow, 1971) was 


administered, 
Results 


Each TA P and MAT subtest was treated separately for purposes of 
nalysis. To investigate the simple relationships among the individual 
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subtests, Pearson product moment correlations (r) were computed. 
These results appear in Table 1. All TAP subtests correlated signifi- 
cantly (p < .05) with all MAT subtests except the TAP Discrimination 
versus the MAT reading subtest (r — .19) and the TAP Sequencing 
subtest versus the MAT Word Knowledge subtest (ғ = .17). 

To investigate the possibility that the relationship between the ТАР 
subtests and the MAT subtest was due to overlap with a general 
intelligence factor, partial correlation coefficients (Partial r) were com- 
puted between each TAP subtest and each MAT subtest holding 
LTTCA scores constant. These results also appear in Table 1. 

All correlations between the TAP subtests and MAT Word Knowl- 
edge, Reading and Mathematics were nonsignificant after LTTCA 
scores were partialled out. Those correlations between the TAP sub- 
tests and the MAT Word Analysis subtest, however, remained signifi- 
cant (p < .01). 

To investigate the possibility that a weighted combination of TAP 
subtests might predict some MAT subtests more accurately than 
would any individual TAP subtest, a stepwise regression was com- 
puted using TAP subtests as predictors and each of the MAT subtests 
аға dependent variable. The results appear in Table 2. The accuracy 
with which the individual TAP subtests predicted performance on the 
MAT subtests was increased by using a weighted combination of TAP 
subtests, This increment in validity was especially notable in the case 
of the MAT Word Analysis subtest. 


Discussion 


The MAT Word Knowledge subtest had low correlation coefficients 
With all of the TAP subtests. The corresponding coefficents of the TAP 
subtests were also low relative to the MAT Reading subtest. Even 
though the Mathematics subtest had modest correlations with the 


TABLE | 
Zero-Order Correlation Coefficients between TAP Subtests and MAT Subtests and 
Corresponding Partial Correlation Coefficients Holding LTTCA Scores Constant 


TAP Subtests 1 
Discrimination Recognition Sequencing 5; 

МАТ Partial Partial Partial 
Subtests r r r r r r 
word Knowl . 2 їйї йй s 

ейре 22% 13 27 61 1 

ord Analysis 4i 266 Мн 260, 3n 26% 
Ming SES ales oc AND eG TAA MAREC URS 
| Mathematics и ЗР, Hoge ‚30% 07 


*p« 05. 
“p< oL 
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TABLE 2 
Regression Analysis Using TAP Subtests as Predictors, 
and MAT Subtests as Dependent Variables 


Multiple Proportion # 


Regression Correlation = Correlation Explained 


Dependent Independent 
Variable Variable(s) Coefficient Coefficient Coefficient Variance 


Word 1196 27° 
Knowledge ге, = Д .1404 51d «rud Al 


Sequencing 5081 E 45*** 20 
Reading Sequencing c 25% р 
Recognition 5 25% Гг” ! 
Mathematics Recognition 1057 39** 
Sequencing 7596 30% 36" 13 
MIS 
"pM 
"p ши. 


TAP subtests, these correlations dropped to nearly zero after partial 


ling out variance associated with the LTTCA. There are several pos 


МЫ explanations for these results. The low correlations between the | 


ТАР subtests and the MAT Word Knowledge and Reading subtests | 


might indicate that these tasks, as measured, do not relate significantly 
to auditory perception, as tested. The marked decrease in correlation 
Coefficients between the ТАР Subtests and the three MAT subtests 
previously mentioned, which was due to partialling out variance asso- 
Siated with intelligence test scores, might indicate either that these 
tasks require a high degree of cognitive mediation which may mask the 


importance of auditory perceptual functioning or that general in- | 


n tests measure a significant component of auditory per 


№, all TAP subtests did produce a zero-order correlation 
coefficient above 30 (p < .01) with the MAT Word Analysis subtest. 
Further the combined TAP subtests produced a multiple correlation 


сое сети of 45 0 < 001) and the ТАР Word Analysis correlation | 


UTICA dee reduced by partialling out variance in common with 
заем The Word Analysis subtest appears to be the MAT 
most directly related to auditory perception, which may ac- 

count for the 4 more cO eit 
к the preceding data analysis that only those 
auditory functions achievement which can be directly associated with 
from the T, ы а very basic level, сап be accurately predicted 
d қ” Therefore the TAP may prove useful іп deter- 
will probably function adequately іп ап acs- 
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ic program stressing auditory skills, ¢.g., phonic reading, and 
children may have undue difficulty if forced through such a 
т. 

erhaps the most potentially important conclusion that сап be infer- 
[гот this study is that it supports the possibility that performance 
ecific skill areas in reading can be predicted. Global tests, such as 
ligence measures, can accurately predict global results such as 
ing ability; but results obtained from global measures are of lim- 
diagnostic use. If utilitarian subskills can be determined, e.g., 
ic word analysis, and if researchers can devise measures which 
reliably and validly predict a given child's ability in these subskill 
$, it may prove a useful teaching strategy to use this information 
to strengthen the area of weakness or to circumvent a wall of 

tion. 
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THE VALIDITY OF THE SRA ACHIEVEMENT SERIES, 
MULTILEVEL EDITION: READING, LANGUAGE ARTS, 
AND ARITHMETIC SUBTESTS FOR MINORITY AND 
NON-MINORITY GROUP FOURTH GRADE PUPILS 


EDWARD B. TOKAR Амр FREDERICK STOFFLET 
Norfolk Public Schools 


Significant positive correlations, with one exception, between rat- 
ings of fourth grade pupils by teachers on a three point scale and 
scores on the Reading, Language Arts, and Arithmetic subtests of 
the SRA Achievement Series, Multilevel Edition, Blue Form E sug- 
gested that these subtests would be valid measures of group academic 
achievement by both minority and non-minority children. А non- 
Significant correlation between SRA reading and teacher's rating of 
minority group pupil reading achievement suggested a need for fur- 
ther investigation. Correlations between SRA subtest scores and 
teacher’s ratings ranged from .20 to .57 for minority group pupils 
and from .46 to .55 for non-minority group pupils. 


THE purpose of this study was to investigate the validity of the 
Reading, Language Arts, and Arithmetic subtests of the SRA Achieve- 
Ment Series, Multilevel Edition, Blue Form E. Of particular interest 
Was the SRA's differential validity for minority and non-minority 
&oup fourth grade pupils. 

To facilitate the analysis, a pupils raw score for the Reading subtest 
Was designated as his SRA reading score (SRAR). Similarly, a pupil's 
Taw scores for the Language Arts and Arithmetic subtests were desig- 
nated as his SRA Language Arts (SRALA) and SRA Arithmetic 
(SRAA) scores, respectively. 


Subjects 


A total of 92 fourth grade pupils, within five elementary schools of 
the Norfolk Public Schools, participated in the investigation. The 
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minority group was composed of 55 black pupils whereas the non- 
minority group was composed of 37 nonblack pupils. 


Criteria 


The yearly average of the teacher's ratings of each pupil's per- 
formance served as the criteria. Teachers assigned ratings corre- 
sponding to a judgment of “progressing slowly," “progressing” or a 
“progressing rapidly." These levels were assigned the quantities 1, 2 
and 3, respectively. Ratings in the factors of understanding, reading 
smoothly, attacking new words, building vocabulary, and reading 
independently were summed and designated as the Teacher's Rating of 
Reading (TRR ). Similarly, ratings in the categories of spelling needed 
words, spelling assigned words, punctuation, capitalization and ex- 
pressing ideas were summed and designated as the Teacher's Rating of 
Language Arts (TRLA). Furthermore, the sum of teacher's ratings of 
sets, place value, division, measurement, geometry, and fractions were 
designated as the criterion Teacher's Rating of Arithmetic (TRA ). 


Results and Discussion 


The SRA tests were administered in October of 1973. The means, 
standard deviations and intercorrelations of the SRAR, TRR, 
SRALA, TRLA, SRAA, and TRA for non-minority and minority 
Pupils are shown in Table 1. 

All SRA subtest scores, with one exception, were significantly re- 
lated to teacher's ratings for both non-minority and minority pupils 
beyond the .05 level of significance. The results would suggest that 
Language Arts and Arithmetic subtests of the SRA Achievement 
Series, Multilevel Edition, Blue Form E., are valid measures of fourth 
&rade minority and non-minority group pupil achievement. However, 
the validity of the Reading subtest, for minority group pupils, should 
be re-examined. 
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A PROFILE OF THE VALIDITY OF POSTGRADUATE 
PSYCHOLOGY EXAMINATIONS IN PAKISTAN IN TERMS 
OF THEIR CONGRUENCE WITH EDUCATIONAL 
OBJECTIVES 


Z. A. ANSARI 


University of Peshawar 
Peshawar, Pakistan 


Within the framework provided by the six process-oriented cogni- 
tive categories in the Taxonomy of Educational Objectives by Bloom 
and his co-workers the essay-type items of postgraduate achievement 
examinations in psychology for students at the University of Pesh- 
awar were classified, and the relative frequencies of the items were 
compared with those for psychology examinations administered at a 
British university. Although the categories of Knowledge and Com- 
prehension were amply represented, the dimensions of Analysis and 
Synthesis would appear to have been very much underrepresented at 
the University of Peshawar. Thus possible sources of invalidity in the 
achievement examinations were identified if the behaviors of the 
examinees both as students and as future psychologists demand 
abilities of Analysis and Synthesis in research endeavors, in report 
writing, and in oral communication as required in teaching or public 
service, 


THE publication of the Taxonomy of Educational Objectives (Bloom, 
Engelhart, Furst, Hill, and Krathwohl, 1956) has provided a frame- 
Work for evaluation of achievement tests in terms of educational objec- 
tives. Standardized achievement tests as well as teacher-made tests at 
all educational levels have been evaluated through using the Tax- 
| Onomy or its variants as a model (Ansari, 1971; Bloom, 1959; 
-. McGuire, 1963; Yorkshire Regional Examinations Board, 1968). | 

The purpose of the present investigation was to evaluate the validity 
_ Of achievement tests being given to MA/MSc Psychology students at 
ше University of Peshawar. Since these examination papers are set by 
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senior teachers of psychology from other universities of Pakistan, 
these tests can be taken as a fair sample of the type of examinations 
being given to the postgraduate students of psychology in Pakistan. 


Material 


The MA/MSc psychology examination in the University of Pesh- 
awar consists of six theory papers and of some laboratory work, 
evenly divided between two annual examinations. Only the theory 
examinations were included in this investigation. According to the 
custom of this country, each theory paper consists of ten or eleven 
essay-type questions; sometimes the last one is a short-notes question. 

Since 1965, when the University of Peshawar held the first М A/MSc 
examinations in psychology, up to 1973, in all 534 essay-type questions 
were asked. Since this number is a very large one, only half of these 
items were used for the purpose of analysis. However, each of the three 
raters missed evaluating some items, with the result that the judgments 
by all the three raters were available for 172 items only. 


Raters 


The raters were three students of MA Final Psychology class, who 
had had Psychological Testing as their major subject. The main points 
of Bloom's Taxonomy were explained to the raters, and they were 
provided with a copy of the main categories of classification. The 
ratings were done independently. The measure of agreement used was 
the classification of an item in the same category by at least two raters. 


Findings and Discussion 


Out of 172 items for 
available, 22 items were 
raters (p = .005), wherea 
Out of three raters (р = 
reached on 114 items (66 
was found in an earlier 
used (Ansari, 1971). 


The profile of the validity of examinations in terms of educational 


Objectives is Presented in Table 1. The results show that the main 


emphasis of Peshawar Universi ina itive 
eae ersity examinations has been on cognitivi 
Objectives of the | : : 


KO OWer categories. Two of the lowest categories, 
nowledge and Comprehension, account for about two-thirds of the 


Which the ratings by all the three raters were 
placed in the same category by all the three 
$ on 92 items there was agreement among two 
069). In all total or partial agreement was 
%). The extent of agreement is similar to what 
investigation in which essay-type tests were 
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ТАВГЕ 1 


Percentages of Items т Postgraduate Examinations of Peshawar and Glasgow Universities 


Assigned to Each of Six Categories of Educational Objectives 


Peshawar Glasgow 


1. Knowledge 36 26 
2. Comprehension 28 18 
3. Application 14 9 
4. Analysis 5 21 
5. Synthesis 3 3 
6. Evaluation 14 23 

Totals 100 100 


examination questions. The four higher categories share the rest one- 
third of the questions. 

In an earlier study the psychology examinations of a British univer- 
sity (Glasgow) were analyzed in a similar manner (Ansari, 1971). The 
two analyses show interesting similarities as well as differences. The 
British university examinations also placed a heavy emphasis on lower 
cognitive objectives, but not to the same extent as in Pakistan. There is 
à greater emphasis in the British university than in Pakistan on ques- 
tions requiring Analysis and Evaluation. Surprisingly, in both the 
universities questions of Synthesis are rather few. This observation is 
particularly noteworthy because those who favor the use of essay-type 
questions maintain that “. . . abilities to select, relate and organize, to 
create essentially new patterns" can be appraised better by the essay- 
type tests (Thorndike and Hagen, 1955). In spite of this potential, it 
seems that the essay-type tests are not being used to measure these 
abilities, particularly in Pakistani universities. Thus if the processes of 
synthesis and analysis are important in the postgraduate curricula of 
Psychology and to the subsequent research and vocational endeavors 
in psychology their lack of representation in postgraduate achieve- 
ment examinations may constitute significant sources of examination 
invalidity in need of correction or remediation. 
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The work of the evaluator is described in Evaluating Educational 
Programs and Products as consisting of the performance of three differ- 
ent activities: establishing perspectives, planning the evaluation, and 
analyzing the data. Following an introductory chapter entitled Pro- 
logue, chapters are organized according to three major themes: “Roles 
and Contexts,” “Models and Strategies," and “Methods and Tech- 
niques for Evaluating Educational Programs and Products"; the text is 
concluded with a chapter entitled Epilogue. Introductory sections set 
the themes to the three major sections of the text. 

Chapters are further classified in a reader’s guide according to prob- 
able relevance of content for an individual’s position in an educational 
system. Using this table, the reader can begin by reading the content 
Which is likely to be of primary importance and leave the reading of 
Other less related chapters to some later time. 

Michael Scriven begins in the Prologue to enumerate some stand- 
ards for use in the evaluation of educational programs and products. 
A Product Evaluation Profile checklist is provided for use by individ- 
uals responsible for the evaluation of educational program and prod- 
uct development. 

Chapter | presents a statement by Jean W. Butman and Jerry L. 
Fletcher on the use of theory to provide a basis for the educational R 
& D process and how the developer/evaluator is constrained by prac- 
tical and political influences in his/her attempt to attain these idealized 
States of operation. 

The next four chapters are directed toward the delineation of spe- 
cific activities of and/or proposed standards for the product eval- 
lation process. Eva Baker considers the role of evaluation in the 
Mstructional development process by relating the formative evaluation 
needs of instructional product developers to data collection activities 
Which can provide the required information on needs. An example of 
the development of an instructional product is also provided to in- 
dicate the role of formative evaluation in the developmental process 
and to remind the reader that the best developed plans may still result 
l in the data to be used for making a product revision coming in after 

the revision has been accomplished. Chapter 3, written by Barbara J. 

Tandes, describes the role of formative evaluation within the devel- 
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opment of an instructional product designed specifically to attain 
affective as well as cognitive outcomes. Noting that different formative 
evaluation strategies are required to attain these types of outcomes, 
Brandes suggests that early in the instructional development process, 
the focus should be on attaining cognitive outcomes and that only 
when the cognitive part is under control should the developer turn to 
looking at affective goals. Alkin and Fink, in their chapter, assert that 
the individual who purchases instructional products is generally not 
given enough information on the appropriateness of the product for 
instructional situation or else information is not summarized in an 
easily digested form. Desirable requisite information items for educa- 
tional product purchasers are provided along with an example of how 
such information has been packaged for one instructional product. 
Even though this chapter was oriented to people “‘out-in-the field,” the 
concerns stated coincide with those presented by Scriven in the Pro- 
logue. Similar in focus to the Alkin and Fink chapter is the next 
chapter by Louise L. Tyler and M. Frances Klein which is concerned 
with specifying and discussing recommendations also categorized as 
to whether they аге to be considered desirable, very desirable, or 
essential requirements in the curriculum or instructional product devel- 
opmental process. 

Part two in the text presents a number of different models (or views) 
Tepresenting procedures used for performing program and product 
evaluation. Wright and Hess begin in Chapter 6 by identifying the 
dimensions of (1) stages of evaluation, (2) audience for the evaluation, 
and (3) domains of evaluative criteria, so that activities to be per- 
formed in educational product evaluation process can be identified 
and the criteria to be achieved at each of one evaluative stage specified. 
The Bertram and Childers and Katz and Morgan chapters describe 
Systems models which indicate the flow of activities and information 
generated through the evaluation process. While Bertram and Childers 
Ha Nodo Ius and procedures in an evaluation process, the 
decision points whi h apter is more general. Their flow model has 
A pen сп raises questions about the extent to which goals 

ng attained and the need for someone to make decisions in 


reference to originally stated goals of the project as mediated by the ' 


meer in which the program-product development operates. 

‚ The next two chapters describe quality control approaches for eval- 
ating educational R & D efforts by Jerry P. Walker (chapter 9) and 
the pocs of process evaluation by Max Luft, Janice Lujan, and 
peus A. Bemis (chapter 10). Both of these presentations include 

ag anon scale which has been used in such evaluation activities. 
лы Six chapters which present the methodological 
паци and products. Chapter 11, Formative Evaluation: Selecting 
ad CUM and Procedures, was written by James Sanders and Don- 
unningham. This chapter describes the so-called “ргоседигев and 


rmative and summative evaluation of educational | 
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materials” of formative evaluation categorized according to appropri- 
lateness for each of four different stages of formative evaluation: pre- 
development, evaluation of objectives, interim evaluation, and prod- 
Іші evaluation. 

The remaining five chapters deal more specifically with the analysis 
laspect of evaluation. Borich and Drezek in chapter 12 use the correla- 
tional causal relationship methodology developed by Blalock to ascer- 
lain the validity of hypotheses relating concomitant variables and the 
instructional transaction resulting from the use of a new educational 
product. A computer program is included as an appendix to assist 
‘readers in implementing such an approach to process evaluation. Both 
Eichelberger and Edwards, in their chapters (13 and 14), agree with 
the thesis of Katz and Morgan that the evaluator needs to be aware of 
the environment in which the evaluation is being performed. Such 
‘information can be used to identify the level of analyses to be used in 
the formative and summative evaluations of educational programs 
(and products. Both authors are very strong in the use of advanced 
‘correlational-based analysis procedures for evaluation. Poyner, in 
chapter 17, considers specifically the question of what is the appropri- 
ate unit of analysis of evaluation studies. He concludes that hier- 
archical models with classes serving as nested factors represent the 
analysis of procedure of choice since this procedure allows one to test 
the assumption of equivalency of classes. When this assumption is 
rejected, the class mean becomes the unit of analysis. Ifthe assumption 
is not rejected, then the individual pupil can be used as the unit of 


tical analysis procedures for evaluation studies. They present four 
different types of situations and discuss the use of alternative analysis 
models. Their conclusion is that no one design and/or analysis strat- 
egy will work in all settings; however, the guidelines provided in the 
chapter will be useful to all individuals involved in the analysis of data 
collected to document effectiveness of a particular educational product 
9r program. ) 

The Epilogue written by P. Kenneth Kosmoski attempts to describe 
the future of instructional product development. Specifically, he feels 
that educational product developers must begin to provide alternative 
instructional modes to the printed page and that future prospective 
buyers of such goods will require hard evidence of the effectiveness of 
Such projects to bring about learning. IU SA 

Since instructional product and program evaluation is still a new 
and developing field of study, it is not possible to evaluate the contribu- 
lions in this book for accuracy of content as one could do in the 
valuation of a text in such a field as Statistics. In contrast, it does 
Eo to evaluate the extent to which this book does achieve its 
| Boals. 
The editor stated in the preface that the “book is not a textbook or 
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collection of articles, but an especially prepared guide and handbook 
for planners, developers, and evaluations of educational programs and 
products (р. vii). In light of this statement, the reviewer was struck by 
the lack of continuity. Since the editor stated a long (two years) and 
intensive В & D cycle of review and revision was performed in the 
production of the text, one would expect a smooth-flowing, balanced 
presentation of elements of formative evaluation. Unfortunately, this 
is not the case. The chapters appear to have been written without any 
consideration or, more likely, without any knowledge of what the 
other authors were doing. None of the contributed chapters contain a 
reference to any other chapter in the text. This was surprising in view 
of the unavoidable overlap in context treated by the several chapters. 
For example, the Prologue and Epilogue chapters present some sugges- 
tions as to information needs for prospective users of instructional 
products, the topic covered by Alkin and Fink and yet no reference is 
made to any of the other's ideas on the subject. Another example was 
the consideration of which unit of analysis is appropriate to use in 
evaluation studies. This topic is considered briefly by Edwards while 
Poynor devotes his whole chapter to this question and neither cites the 
other. This reviewer feels an individual who takes on the responsibility 
of coordinating the gathering of information from several authors 
Should also use the review process to inform an individual author of | 
what the other contributors have said that is relevant to their topic. 
This reviewer also feels a handbook should essentially cover the 
field. Thus, a reasonable approach would be to identify a number of 
different aspects of the instructional product-program evaluation proc- 
ess and then to have specific chapters written to cover each of these 
areas. Thus, the overlap of topics treated and wide range of specificity | 
of content covered in the chapters was surprising in view of the claims 1 
of careful planning and selection of topics. The extreme range of | 
Specificity is denoted at one end by one chapter іп the methodology | 
section which only covers the topics of appropriateness of units of 
analysis and at the other extreme by a chapter which exhaustively 
Covers measuring instruments and associated operational procedures 
useful in formative evaluation. 
ec une Ms Teviewer feels the book will not serve as a basic 
Е NS performing formative evaluations of рго- 
from the viewpoints " P ads by being selective, readers can benefit 
one way of bein BA practical suggestions presented in the text. AS | 
Кан ы Б selective and optimizing the investment of time, this 
T Suggests that the prospective reader cover the major part 


introductions and consul ' gui i | 
tther е: 
ш eaders' guide before turning to spe 


JOHN L. Wasik ay 
North Carolina State University 
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|. J. Eysenck. The Inequality of Man. San Diego, California: EdITS 
Publishers, 1975. Pp. 288. $8.95. 


he thesis of this book written for the educated layman is that 
ability in human traits is primarily genetically determined. Ey- 
nck is concerned with the origins of differences among people, re- 
rdless of race or ethnic background. (The terms “гасе” and "ethnic" 
) not even appear in the index.) The popular slant of the presentation 
indicated by the size of the bibliography: only 79 references are 
As usual, Eysenck's arguments and evidence are conclusive; his 
sition appears to be strongly supported while the environmentalists 
е generally discredited as persons overwhelmed by a desire for social 
form and, thus, suffering from an inability to recognize the truth in 
jentific studies. Like a good lawyer, Eysenck presents the case for 
netic inequality by emphasizing the favorable evidence and selecting 
OF critical analysis those investigations most embarrassing to the 
pponent. His unrelenting attack includes unflattering character- 
ations of the rival, e.g., "extreme environmentalists,” “а determined 
litarian,” “convinced environmentalists," etc. And he is a master 
the use of persuasive techniques; his most effective ploy is to occa- 
опа Шу concede a minor point to the opponent in order to give the 
ipression that his presentation is truly unbiased. 

Оп the other hand, Eysenck continues to be one of the most effec- 
ive translators of the results and implications of scientific psychology 
a form digestible by the interested layman. This recent volume 
be viewed in the tradition of Uses and Abuses, Fact and Fiction, 
Your Own 10, etc. Eysenck’s explanations of heritability, test 
alidation, research design, regression to the mean, and other techni- 
al topics communicate accurately to the non-specialist the meaning of 
hat are generally considered to be fairly advanced issues in psychol- 
у. Finally, any tedium which remains is eased by Eysenck’s colorful 
iting style and use of eye-catching examples interspersed throughout 
book, e.g., his description of a study of the relationship between 
Neer of the womb and circumcision (рр. 27-28), a sample of items 
om the m/f scale of sex attitudes (“1 would enjoy watching my usual 
Partner having intercourse with someone else," p. 30), а correla- 
n of —.63 between 1Ю and number of teeth missing (p. 78), his 
using comparison of the consequences of a “тейіосгасу” with a 
meritocracy (р. 222), etc. aie 2 

7 Approximately one-half of the book is concerned with intelligence 
ree chapters), with one chapter summarizing personality-related 
Pics, and the final chapter reserved for a review of the social and 
litical implications of the conclusions reached. Chapter one (Equal- 
апа Individuality) introduces the reader to Eysenck’ 
d includes brief synopses of three previously published books of 
nilar title (authored by Rousseau, T. H. Huxley, and J. B. S. Hal- 
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dane); Eysenck thus puts himself in historical perspective! Chapter 
two (What Do IQ Tests Really Measure?) is a defense of psychometric 
assessment of intelligence—Spearman's G in particular. Eysenck hasa 
predilection for drawing parallels between measurement problems in 
psychology and those previously encountered in the physical sci- 
ences, e.g., the measurement of intelligence is compared to the devel- 
opment of the thermometer; Spearman is compared with Dalton, the 
founder of modern atomic chemistry; the physical concept of “hard- 
ness" is compared to the psychological construct of intelligence, etc. 
Intelligence is demonstrated to correlate with "success in life" (in- 
come, occupational prestige), the evidence for creativity as an inde- 
pendent construct is found deficient, and the research on evoked poten- 
tials suggests a physiological basis for intelligence, all leading to the 
conclusion that the search for alternatives to G intelligence has failed. 
The third chapter (Intelligence and Heredity) summarizes the evi- 
dence regarding the relative importance of hereditary and environmen- 
tal influences in accounting for variation in intellectual functioning. 
The logic of twin studies and other methodologies are explained. 
Eysenck reviews studies of identical twins raised separately, fraternal 
twins, relatives, inbred individuals, orphanage children, adopted chil- 
dren, retardates, and environmental factors. The Milwaukee Project 
and Ellis Page's critique are summarized. The chapter concludes with 
consideration of two alternative explanations to the genetic hypoth- 
¢sis—malnutrition during infancy and social disadvantagement: stud- 
ies of Dutch children exposed to famine and “deprived” Eskimos are 
reviewed to rebut these proposed explanations. Chapter four (In- 
telligence and Social Class) continues the devastation of environmen- 
tal explanations of individual variation, e.g., genetic forces guarantee à 
redistribution of intelligence (and, thus, opportunity) over gener- 
ations, while environmental factors tend to perpetrate class dis- 
MODE the relationship between intelligence and social class for 
ie ne children is consistent with the genetic hypothesis; educa- 
"a as little impact on pupil performance relative to intelligence 
(“Education is second only to psychoanalysis in making claims for 
untested and untried methods . +. (р. 151)”); etc. 
дра fn ona, Мала en a Crime) prens an 
mental illness, and a eredity in the causation of criminal behavior, 
ТӨЕНЕ ШОС кора in normal personality functioning. pend 
moderate conclusio 5 IE les, Eysenck reaches an uncharacteristicaly 
important d ПОП: heredity and environment are about equally 
P eterminers of criminal conduct, He even becomes tempo- 
hend humble: “1 am not competent to argue . . . (p. 173)" апа“... but 
ші дарет! New and not evant 1)" Reviews of 
vide support for a хақас (со ОЕШ) and schizophrenia a P 
theP, E and N о hereditary component. А few studies 
ШИШ "ё of the EPI lead Eysenck to the conclusion tha 
JY... accounts for not less than 50% of the total variance, and 
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may account for as much as 70% (p. 201)." Eysenck enlivens the 
chapter by attacking Laing's family oriented theory of schizophrenia 
and taking on the Women's Movement by arguing that psychological 
differences between the sexes are biologically based. 

The last chapter (Social Consequences) considers various “‘implica- 
tions of the facts surveyed in the previous chapters." The chapter 
begins with an analysis of Herrnstein's well-known syllogism and its 
corollaries: some are “undoubtedly true" while Eysenck's evaluation 
of others indicates that '*Herrnstein is definitely beginning to run off 
the rails in his predictions (p. 217)." (Herrnstein may have under- 
estimated the magnitude of regression effects.) Several examples are 
given which illustrate the potential dangers in disregarding the facts of 
biological inequality, e.g., the elimination of IQ tests in selecting chil- 
dren for higher education in Britain and the enforcement of affirma- 
tive action programs in the U. S. will result in lower quality of educa- 
tion for all. Eysenck's basic premise is that psychology can contribute 
to the improvement of the human condition only if inequality is 
recognized and differential treatments are viewed as fair for all; Jen- 
sen's level I and II abilities distinction and Sarason's modeling re- 
search with delinquents are used as illustrations. A potpourri of topics 
comprise the remainder of the chapter, e.g., diabetes, psychosis, and 
asthma are examples of genetically caused defects which are amenable 
to preventative treatments; Atkinson's CAI models illustrate the quan- 
tification necessary to objectify political decision-making; vocational 
satisfaction is viewed in terms of tempermental suitability for the job; 
ete, 

At several points in the book Eysenck reassures the reader that he is 
not an extreme hereditarian and that he is only trying to establish a 
balance between the relative importance of hereditary and environ- 
mental causes of behavior, e.g., “We must learn to recognize the 
importance of interaction in regard to all aspects of behavior; it clearly 
will not do to slight either the importance of heredity or that of 
environment (р. 242).” On two separate occasions he makes the criti- 
cal point that heritability estimates currently available are based on 
naturally occurring variation in the environment and, thus, cannot be 
generalized to behavioral treatments or programs which extend be- 
yond the normal limits. 

Nevertheless, since the book was aimed at the layman, who prob- 
ably is not familiar with Eysenck’s inclination to engage in controversy 
and his talent for presenting an impressive case for his point of view, a 
final chapter or rejoinder written by “а determined egalitarian” would 
ТҮН done much to promote the balance that Eysenck hopes to 
achieve, 


BRIAN BOLTON 
University of Arkansas 


1044 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


Gene V Glass, Victor L. Willson, and John M. Gottman. Design and 
Analysis of Time-Series Experiments. Boulder, Colorado: Colo- 
rado Associated University Press, 1975. Pp. xi + 241. $10 


For a good many years the field of design and analysis of behavioral 
science experiments has changed primarily by accretion. That is, there 
has been more and more development of what is currently referred to 
as the "randomized comparative experiment." One of the first breaks 
with what has become a major tradition was Campbell and Stanley's 
Experimental and Quasi-Experimental Designs for Research (1966), 
which examined several designs outside of the mainstream framework, 
in particular, the interrupted time-series experiment." Research utiliz- 
ing these procedures appeared slowly through the 196075, but of late 
has begun to surface more and more frequently. Glass, Willson and 
Gottman’s book deals solely with time-series experiments and marks a 
major change in our current experimental design and statistics tradi- 
tion. It will, without doubt, be looked back upon in future years as à 
watershed publication. Prophecy is always dangerous, but several fac- 
tors make it a safe bet that within a decade one-third to one-half of the 
material currently taught in graduate experimental design and statis- 
tics courses to students in the behavioral sciences will have been 
replaced by material similar to that in this book. Some of the reasons 
for this change are obvious, others are more subtle. 

In the first place, the very extent of the development of randomized 
comparative experiments has focused attention on the fact that they 
are best at appraising the results of treatments inserted into an experi- 
mental medium at a particular point in time and manifesting results 
immediately thereafter. But, in social systems as well as in the lives of 
individuals, one is more likely to see patterns and gradations of experi- 
mental effects rather than simple, sharply-delimited expressions of 
them. Time-series analysis allows a wide-angle view of experimental 
effects, Second, single subject designs in the operant conditioning field 
i hold s number of areas utilizing behavior modification tech- 
ind Hle aig i without an explicit design and analytic frame- 
ok sinais БОШКА ан ysis provides one. Third, single subject designs 
ООО |с у more popular in all ares 
increasingly greater difficult RU Work in education because о 
of subjects and becaus ^ ttes in obtaining sizeable, multiple groups 
istrators to go alon Mn E ring unwillingness of school admin- 
evaluated for the eflct и Беша classroom groups to £ 
ROSA co of different treatments. Time-series analysis 
ies. Finally, а bri SOR framework for the analysis of such stud- 
and design" Backs chow ere of the dozen most popular “statistics 
ciable amount of materi Қ d он none of them Oil San ua 
ate’ to Berge. io is related to basic design considerations. They 
factorial arrange bere оо rationale that jane 

gements of treatments, but the basic ideas of experimen- 
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tal design receive almost no attention. This frustrates teachers and 
students alike. Time-series analysis goes off on such a disparate tan- 
gent to the traditional approaches that a great deal of time must be 
devoted to design considerations. This has great appeal to graduate 
students and makes the statistical and analytical procedures that are 
included seem much more appealing. For all of these reasons, the field 
of time-series experiments is one of the waves of the future. 

A time-series is basically a set of observations on some dependent 
variable taken in sequence, each measurement separated from the 
other by a (generally) fixed amount of time. In the simplest time-series 
experiment, one repeatedly observes a subject or a group with respect 
to some dependent variable. After a number of observations, a treat- 
ment of some sort is imposed, and observations are then made through 
а number of subsequent periods to determine whether the level of the 
dependent variable remains the same or changes in some systematic 
fashion. The major analytic problem is that there are, of course, 
fluctuations in the level of the dependent variable both before and 
after the jntervention; and one has to determine whether changes 
Subsequent to the intervention are different from what could be ex- 
pected on the basis of the random shifts inherent in the series. 

The primary design considerations in time-series experiments are 


the same as in any research problem, but there are several ideas which 


receive particular emphasis. They revolve around whether the same 
group or different groups are observed at different points in time, 
whether more than one group is observed simultaneously, whether one 
intervention or more than one is used with each group, and whether 
the effects of the two or more interventions have been observed in 
teverse order to enable assessment of interaction or sequence effects. 
Interaction effects are viewed from a different perspective than in 
tandomized comparative experiments: the effects on the dependent 
variable become more complex since one does not examine them at a 
Single point in time but at many points in time. Rate of change, 
magnitude of change, or even lack of change will each be interpreted 
differently depending upon the timing of the intervention, the duration 
of the intervention and temporary occurrences in the experimental 
field. Glass et al. present an extremely thorough and, on the whole, 
quite readable discussion and evaluation of various time-series designs 
and interpretation problems. The level of discussion is kept admirably 
even, and should be intelligible to any interested student. A great 
Many examples from the literature of a variety of fields are examined 
in the context of each design question. Another strength of the dis- 
cussion is that it is critical: the multiple baseline design introduced by 
Risley and Baer in 1969 which is so widely used in the field of operant 
Conditioning is shown to depend upon a logic that may be asymmet- 
tical or even contradictory. The authors are not shy about discussing 
and naming the many factors which may work to invalidate time-series 
experiments. They discuss in detail such factors as historical invalidity, 
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reactive intervention, multiple intervention interference, instrumenta- 
tion problems and the like. 

The mathematical approach utilized by Glass et al. derives from a 
development of spectral analysis due primarily to Box and Jenkins 
(1970), and involves linear models collectively termed “ашо-гергевзіуе 
integrated moving average models," or "ARIMAs." The simplest 
cases are an auto-regressive model and a moving averages model. In 
simple terms, an auto-regressive model is one in which succeeding time 
states depend upon preceding ones. Actually, of course, the depend- 
ency may be over a time difference of several observational periods. If 
the relationship is from one observational period to the next, this is 
called a lag one process, if from the first observational period to the 
third, it is called a lag two process, and so forth. The moving averages 
model essentially assumes that the state of a process at a given time is 
the average of a series of random shocks delivered at preceding peri- 
045. Thus, while there is no significant correlation between one time 
period and the next, the present state of the series is an average of what 
has gone before. 

А problem in analyzing time-series is the fact that some are “пол- 
stationary," that is, they tend to wander temporarily from one overall 
level to another and back again in stochastic fashion. Taking differ- 
ences between successive time points (lagged by one, two, three or 
more time periods) will generally serve to convert a non-stationary 
process to a stationary one which may then be described in terms of an 
я or moving averages model (ог a combination of the 
gins преда introduced by Box апа Jenkins and capital- 
of Шо. y ass et al. has the conceptual simplicity and beauty 


whether it is auto-regressive, requires differencing, and/or has moving 


calculus of operators for classifying and working with ARIMA mod- 
ple and straightforward for anyone with 
models. The computations involved are 
Well illustrated, and utilize both good real 
ellent simulated data sets. ) 
of the experimentalist, (һе real problem in 
hat of determining whether or not an inter- 
t over and above what might be accounted 
es, estimating its magnitude. The procedures 
require several steps. First of all, one identi- 
and post-intervention). Parameters are esti- 
€ series determined. The intervention effect is 
8 maximum likelihood procedure, which de- 
minimizing the error sum of squares. This in- 
ons of a set of linear equations by varying values 
EE rameters until a minimum is found. Again, the 
па computations are well illustrated. Moreover, the au- 


| 


BOOK REVIEWS 1047 


` thors have developed a set of computer programs to facilitate com- 
- putations which should bring use of the model quickly within grasp of 
many prospective users. 

The next to the last chapter of the book deals with analysis of 
related time-series through concomitant variation. That is, if one series 
is related to another, one may often discover useful information about 
one from known characteristics of the other. This procedure is in 
many ways very similar to co-variance analysis and will thus have a 
familiar ring for many behavioral and social scientists. 

The last chapter of the book deals with several specialized topics 
including deterministic drift, changes in variance, changes in the 
model, and cyclic variation. Although they are short, these sections are 
well written and useful. 

There are appendices which deal with spectral analysis of time-series 
and with the linear model and least squares theory. Both of these 
appendices are too compressed to be of use to anyone who does not 
have the fundamentals of those techniques reasonably well in mind to 
begin with. Another appendix contains listings of the data sets used to 
illustrate particular problems and should, together with the programs 
which the authors have written, be most useful. 

The formal aspects of the book are rather unexceptional except that 
the manuscript has been typed and photographed so that the right 
margin is not justified. A good many of the "instant books" produced 
in this fashion are full of typos and are poorly edited, but the Glass et 
al. volume does not belong in this category. It is well constructed and ~- 
put together, there are few typos, the only bad one being in the third 
paragraph on page 48 where “unobserved” is substituted for “оВ- 
served." One would expect the symbols and formulae to be excep- 
tionally poorly set and hard to read, but they come through well. As 
stated above, the examples are excellent, and they are completely and 
accurately cited in an excellent reference section. 

Glass et al. set out to show where new ground has been broken and 
10 direct others not only in where to plow their own furrows but in 
how to keep them straight. On the whole, they have done an excellent 
job, and it is likely that this book will come to be viewed as a true 
landmark. It differs from many landmarks in that, although a first 
effort, it is exceedingly well organized and well constructed and should 
stand the test of time exceedingly well. ) 
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John L. Hayman, Jr. and Rodney N. Napier. Evaluation in the 
Schools: A Human Process for Renewal. Monterey, Calif.: Brooks/ 
Cole, 1975. Pp. xi -- 143. $3.95 (paperback). 


This short, concise book may be read with profit by anyone con- 
cerned with evaluation іп the schools. It is not a comprehensive treat- 
ise on evaluation, since many of the technical details about research 
design, statistical analysis, and instruments employed in evaluation 
studies are not presented. Matters such as how to survey, interview, 
select tests, and the like are not discussed at all. Rather, what the 
authors stress are procedural matters and the human factors involved 
in evaluation, including process and subjective evaluation. Also em- 
phasized are the facts that evaluation can be helpful or harmful, and it 
is the responsibility of the evaluator to determine which it shall be. 

The authors possess a penchant for outlining and schematizing, а 
talent that will facilitate the use of the book as a sourcebook and a text 
but which results in somewhat slow reading. As stated in the Preface, 
this is not a cookbook; it is an attempt to prepare the would-be 
evaluator through means of practical examples, Such examples are less 
plentiful in the first four chapters, which are more like expanded 
outlines of principles and procedures, than in the last four. Further- 
More, the authors maintain that the book is appropriate for under- 
graduates in educational psychology and related courses, as well as 
educational practitioners. Certainly many undergraduates in educa- 
tional fields could study the book with benefit, but most will probably 
find it fairly difficult reading. This reviewer considers the book's 
primary value to be in training educational evaluators—when supple- 
mented extensively by books on educational measurement, adminis- 
tration, and psychology. 

The eight chapters comprising the book are entitled: 


i Ba New View: Evaluation as Integral to the Educational Proc- 


. Planning for Evaluation 
‚ Goals and Objectives in School Evaluation 
д а іп Ргоргат Development 
. Process Evaluation: Pro anizational 
Applications gram Development and Organizati 
у ВО of Process Evaluation for the Classroom Teacher 
5 Бау c. апа Evaluation in the Schools 
: Synopsis: A Brief Guide to the Evaluation Process 


Every chapter is i Е / 
Well written, if Med din 
summary, » I succinct, and contains a concluding 


Throughout the chapters the 
results of evaluation fro 
is to be more than a 
chapters are highly str 
ting, and utilizing eval 


ле 


point is repeated that feedback of the 
m evaluators to users is essential if evaluation 
futile exercise in wasting time. The first four 
üctured, describing methods of planning, effec- 
uation, as well as input, process, and outcome 
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© variables. Beginning with Chapter 5 the orientation of the book be- 
comes more psychological, or rather socio-psychological. The descrip- 
tions of group dynamics and process evaluation in Chapters 5 and 6 
are especially good. Chapter 7 on accountability deals more with 
social psychology and management than with technical issues con- 
сегпіпр the measurement of change. There is а good discussion of 
Dyer's multiple regression approach to accountability and a recogni- 
tion of its shortcomings. The weaknesses of the multiple regression 
approach, and the necessity of complementing it with a *management- 
by-objectives" approach, are discussed. The latter approach, of 
course, also has weaknesses: getting participants to state objectives 
clearly; extensive administrative time required; the fact that the ap- 
proach sometimes results in psycho therapeutic soul-searching more 
than realistic attempts to state and attain objectives. 

In Chapter 8, a comprehensive summary of the previous seven 
chapters, the authors continue to make important points concerning 
such matters as the role of environment in evaluating affective objec- 
tives; the value of judgmental evaluation information; the use of 
teacher-researchers in planning, effecting, and evaluating the results of 
research in the schools; and the necessity of creating understanding 
and rapport on the part of the teachers, administrators, and others 
affected by evaluation. 

Lewis R. AIKEN, JR. 


John С. Loehlin, Gardner Lindzey, and J. М. Spuhler. Race Differen- 
ces in Intelligence. San Francisco: Freeman, 1975. Pp. xii + 380. 
$12.00 and $5.95 (paperback). 


This superb volume should be the final word on origins of race 
differences in intelligence. Not because it provides any ultimate an- 
swers, but for exactly the opposite reason. All evidence currently 
available has not really brought us much closer to any definitive 
conclusions. In fact, this comprehensive statement of our ignorance 
will probably serve to trigger an avalanche of investigations. The 
authors virtually predestine this event by outlining 10 “promising 
areas of research” in the last chapter. ; A 

Race Differences was prepared under the auspices of the SSRC's 
Committee on Biological Bases of Social Behavior. The authors com- 
pleted their first year's work on the book in the intellectually stimu- 
lating environment of the Center for Advanced Study in the Behav- 
ioral Sciences. They were assisted by an advisory board consisting of 
20 distinguished scientists, educators, and public figures (from Anas- 
lasi to Wolfle). A draft of the manuscript was reviewed by advisory 
board members, by six minority-group consultants, and—in various 
degrees of thoroughness—by 50 prominent academicians in the biolog- 
ical and social sciences (including Eysenck, Jensen, and Shockley). 
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The authors’ stated goal was “to provide a sober, balanced, and 
scholarly examination of the evidence” regarding the relative contribu- 
tions of genetic and environmental variations іп explaining racial- 
ethnic IQ differences and to discuss the social and political implica- 
tions of the results. Their purpose is impressively achieved in this tour 
de force of scholarship. The book is extremely well organized and 
beautifully written. Every conceivable source of relevant data is re- 
viewed. The evaluations of the investigations are insightful and 
thorough, Summaries and discussions are Strategically placed in the 
text to have maximum relevance and impact. Technical issues are 
handled in 70 pages of appendices which are marvels of clarity and 
comprehensiveness 

And while the authors’ conclusions are almost always multiply- 
qualified and hedged, they point out repeatedly that the inconsistency 
and poor quality of the evidence necessitates their extreme tentative- 
‘Ness, eg. regarding nutrition and black-white IQ differ- 
ences: "The best evidence needed to answer these questions is lacking, 
and the relevant, usable information is scanty, and of unknown or 
dubious reliability (p. 225)." 

. Race Differences consists of 10 chapters which are organized into 3 
major sections: Issues and Concepts (four chapters), The Empirical 
Evidence (four chapters), and Conclusions and Implications. 

Chapter one (The Problem and Its Context) traces the roots of “the 
controversy” to Darwin (evolution), Galton (inheritance), and Men- 
del (genetics). Intelligence tests are characterized as “one of the most 
il accomplishments of the social sciences (p. 
5)." The key role of Jensen's 1969 HER article, the involvement of 
subsequent protagonists (Herrnstein, Eysenck, and Shockley), and the 
wed. The chapter concludes with eleven cap- 
Sistent misconceptions” which serve to in- 
troduce terminology and preview later о s 

apter two (Race as a Biological Concept) presents a brief over- 
Na illustrates racial variation in color and size 
ie Ouse sparrows, and reviews race formation in 
Prehistorical man. The authors rephrase the central question of the 
at determine intellectual capability differen- 
8 the major races? — and evaluate the four 
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оп, (2) the choice between С and a multi-ability conceptualization 
а matter of convenience and purpose, and (3) it is not innate, but, 
er it is developed. Reasoning by analogy using the trait of stature 
"leads to the suggestion that intelligence may have been differentially 
selected in different environments in the past. Finally, three studies of 
test-bias reversal (rural children, Indian children, and blacks) suggest 
I that IQ tests may be biased against some groups. Not all studies 
| support this conclusion, however. 
| Chapter four (Heritability) is а rather technical introduction to a 
| number of complex topics, including developmental genetics, co- 
| Variance and interaction, genetic variance components, distinctions 
| among heritability coefficients, comparisons of models of inheritance, 
| heterosis, etc. Among the major conclusions are the following: (1) 
| Within-population estimates of IQ heritability may, but need not have 
C 


yalue for interpreting between-population (racial or cultural groups) 
differences in IQ-test performance. This point was the basis of Kagan's 
| criticism of Jensen's HER conclusion concerning possible racial differ- 
ences. (2) The broad heritability of IQ in Caucasian populations lies in 
the range from .60 to .80. Two prominent exceptions to this conclusion 
are the estimates calculated by Jencks and Kamin. While Jenck's 
estimate of .45 poses no serious problem, Kamin's assertion of zero 
“heritability, if supportable, would severely damage the authors’ argu- 
“Ment, Therefore, they thoroughly review Kamin’s evidence, analyses, 
— and arguments and convincingly demonstrate that his conclusions are 
"based on a biased, selective review that suffers from several logical and 
Statistical errors. 1 | 
| Chapter five (Genetic Designs) presents а detailed review and eval- 
lation of (1) twin and sibling studies which provide comparable herita- 
bility estimates for black and white samples, (2) interracial adoption 
Studies, (3) studies of half-siblings, and (4) studies of subjects of mixed 
facial backgrounds, e.g., correlation of skin color and intelligence, 
"correlation of blood group genes and IQ in blacks, offspring of black 
"Soldiers:in Germany іп WWII, etc. Although every one of the dozen 
“studies reviewed suffers from one or more defects—sampling limita- 
ons, incomplete information, and various other methodological 
aws—the authors conclude that (1) IQ is substantially heritable in the 
US black population, and (2) the existing genetic evidence could be 
“used to support either environmentalist or hereditarian viewpoints. 
Chapter six (Temporal Changes in 10) consists of a review and 
- evaluation of studies of population trends in IQ, developmental stud- 
les of IQ change, and the effects of compensatory education programs. 
| The authors draw оп a wide variety of data sources, €8.» Binet and 
WISC standardization data, the Coleman report, US Army induction 
_ test data, etc., to reach three conclusions: (1) population increases in 
| 10 are associated with educational improvement, (2) black-white 10 
‘differences emerge by age three or four and generally remain fairly 
‘Stable throughout the school years, and may increase during adult- 
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hood, and (3) the massive research effort in compensatory education 
has not produced any breakthroughs. The authors summarize by stat- 
ing that while environmental factors do influence IQ development, this 
does not rule out substantial genetic determination of individual or 
group differences. 

Chapter six is notable for two reasons. It is the first to include data 
(several tables) which document the black-white IQ difference of one 
standard deviation referred to initially in chapter two. The fact that 
there is no chapter devoted to the establishment of the accuracy and 
meaning of this performance difference suggests that the authors have 
considerable confidence in the available evidence and measuring in- 
struments (but see their major conclusion in chapter ten!). The second 
point concerns the generally positive evaluation of Jensen's position. 
In fact, some will view Race Differences as a vindication of Jensen. On 
several occasions the authors imply that Jensen has been misread or 
misunderstood. In an interesting (if not intentional) phrasing, they 
conclude that the failure of compensatory education to ameliorate 
black children's scholastic deficits “made it not unreasonable for Jensen 
(1969) and others to reopen the question of a possible genetic com- 
ponent (p. 162).” The italics in the quote are added to emphasize the 
similarity to Jensen’s original statement (1969, p. 82). Finally, Jensen 
is cited 22 times in the text (Burt, Nichols, and Shuey have nine 


Continues the review of studies which bear on the main question 
addressed in the book. The most Consistent evidence so far concerns 
A A ial-ethnic groups: the underlying 
dimensions are the same. The investigations of ability profiles of vari- 
ous racial-ethnic groups (which might be hypothesized to be the effect 
on to different environments) are 
there is evidence of differences in 
ks do better on verbal than non- 
nce for a specific perceptual deficit 
with an excellent discussion of the 
phenomenon and a good summary 


authors calculate that the cumulative 
effect may account for “a few points” 
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reviewed in chapters five through eight. The distribution by topics is: 
quality of evidence (2), heritability of IQ (3), racial mixture (5), com- 
parisons of socioeconomic and racial groups (7), and nutrition (3). 

Chapter ten (Implications and Conclusions) begins with a statement 
of the authors’ conclusion regarding the sources of observed IQ differ- 
ences among racial-ethnic groups: the differences are due to (1) psycho- 
metric deficiencies in test instruments, (2) differences in environmental 
conditions, and (3) genetic differences. The relative weight assigned to 
each component is simply a matter of judgment! The authors are fully 
aware of the indeterminancy of their major conclusion and explain 
that the evidence does not support any stronger statement. A few 
pages later they do make a slightly stronger assertion: “We consider it 
quite likely that some genes affecting some aspects of intellectual per- 
formance differ appreciably in frequency between U.S. racial-ethnic 
groups .. . (p. 240)." 

So what do we really know about the origins of racial differences in 
IQ? Surely, there is something wrong when the strongest conclusion 
that is scientifically warranted is virtually meaningless. The authors 
have convinced me that the question is simply unanswer- 
able at the present time. Should research on racial differences in 
intelligence continue? There are too many socially important topics to 
be researched—and no conceivable conclusion concerning the source 
of black-white IQ differences is going to alter the fact that it is the 
individual, not any particular population subgroup, about whom deci- 
sions are made and for whom opportunity exists in а democratic 
society. The authors continually stress this point and the well known 
fact that much more IQ variation occurs within racial-ethnic groups 
{һап between them; both points would seemingly detract from the 
importance of research addressing the primary question of the book. 

The remainder of chapter ten consists of two excellent sections: (1) a 
discussion of the social and political implications of possible genetic 
differences in intelligence, and (2) an assessment of the social context 
of "sensitive" research, followed by brief descriptions of ten promising 
areas of research on racial-ethnic differences. The authors affirm their 
belief in the importance of basic research in general, and judge (some) 
research on racial differences to be of fairly high scientific and social 
priority. Furthermore, they support Shockley's (or anyone else s)right 
lo investigate possible dysgenic trends within racial groups, but accord 
it low priority. { : 

As a reviewer of Race Differences 1 feel ап obligation to emphasize 
the distinction between my evaluation of the authors’ effort and prod- 
uct—probably the most scholarly volume I have ever read—and my 
Teaction to their conclusions—disappointment. Of course, they cannot 
be held responsible for the inconsistency and poor quality of the 
evidence. Nor should the many investigators whose research was re- 
viewed be blamed—it is an extremely difficult question to study. My 
Personal feelings about research on racial differences are perfectly 
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stated by Loehlin, Lindzey, and Spuhler: “When one considers the 
hundreds of inconclusive comparisons of black and white IQs that 
have appeared in published studies in the scientific literature in the last 
several decades, the thousands of pages of print devoted to discussion 
of race and IQ, and the massive commitments of human and financial 
resources based on assumptions about matters to which these studies 
are directed, one can hardly help but feel some serious misgivings 
about the responsiveness and responsibility of the system of rewards 
and support that determines which research gets done in the social 
sciences (p. 256)." Amen. 


BRiAN BOLTON 
University of Arkansas 


Elijah P. Lovejoy Statistics for Math Haters. New York: Harper and 
Row, 1975. Pp. x + 25]. $8.95 paperback. 


One might paraphrase the author to indicate more fully his intent 
and approach as “Тһе Statistics of Psychological Experimentation for 
Undergraduates Initially Aversive to Mathematics.” This would be 
true as indicated later in this Teview, but it should not be taken to 
imply limitation of the use of this publication to individuals in such 
circumstances, 

, This textbook/manual benefits from its explicit focus on those being 
introduced to the tools of statistical logic in an undergraduate major in 
general psychology. We may accept the author's and editor's assur- 


BOOK REVIEWS 1055 


etc. Here it would have supplementary use in the initial stages of 
course in inferential statistics. Those who prefer to teach descrip- 
Ve statistics first, as a course worthwhile in itself for teaching means 
mmarizing and presenting quantitative information in readily 
derstandable form, may well continue to reserve inferential statistics 
а second course. Taking that approach and using this publication 
a supplementary workbook in the second course should provide the 
led value of an independent source of stimulation about the think- 
в processes involved. In these applied fields many of the studies fall 
categories variously called quasi-experimental, ex post facto, or 
comparative, rather then rigorously experimental, but the logic 
experimentation is basically the same and the illustrative examples 
ould be clarifying. 
The sequence of topics in Part Ш, following the earlier treatment of 
ірп test, binomial and normal distributions, does unnecessary 
olence to systematic relations, by going from the t-test for single 
ns and correlated means, into correlation, chi-square, and then the 
t for means of uncorrelated samples, concluding with the differ- 
between two-tailed and one-tailed tests. Putting aside the advan- 
іре a separate course in descriptive statistics provides for organizing 
Oncepts and measures under the percentile and moment systems, а 
'ucture for paralleling treatment of discrete and continuous varia- 
les, respectively, gives a meaningful relation among approaches in 
ferential statistics that is lost by purely topical treatment. And even 
math haters” gain something by such structuring. 
It should be noted that analysis of variance (and covariance) is 
mitted as beyond the scope of the beginning course. This is regret- 
ible because the logic of experimentation which dictates that 
INOVA must precede study of mean differences when several means 
ге available, is lost. Of course, in a statistics sequence for psychology 
udents who had used this book, these topics and the logic of multi- 
ariate analysis, discriminant analysis, factorial designs, etc. would be 
уегеа in subsequent courses. 9 
he use of flash cards to acquire concepts and formal definitions 
uld help beginning students, as should also the inclusion of partial 
tatistical tables in the body of the text. Perhaps they are not essential, 
jut they are probably helpful. (By way of contrast the minor exercise 
the relative merits of $8 and $5 Scotch is dubious on about every 
Ount. 
In E. the book seems well designed for its avowed purposes and 
0 have additional value for possible supplementary uses. 


WARREN G. FINDLEY 
University of 
Alabama in Birmingham 
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Jum C. Nunnally. Introduction to Statistics for Psychology and Educa- 
tion. New York: McGraw-Hill, 1975. Pp. x + 342. $10.95. 


There is a veritable glut of textbooks in certain areas of psychology 
and education today. General psychology and educational psychology 
are two such areas, and introductory statistics is fast becoming an- 
other. Given this state of affairs, it behooves an aspiring author of a 
textbook in any of these areas—whether his interest is pedagogy, 
profit, or both—to look carefully at what is already available and find 
a way of doing it better, if he can. This is what the author of the 
present introductory statistics book has tried to do. He apparently 
feels that there are enough statistics cookbooks, mathematically-orien- 
ted statistics books, and books concerned primarily with statistical 
inference rather than description. Verification of this supposition is 
not difficult to find; one need search no further than McGraw-Hill's 
list of a dozen statistics books. ; 

_In a way, it is good to have textbooks of varying genre available; it 
Bives instructors a choice. Unfortunately, instructors often seem to 
select textbooks for themselves rather than their students. A nd the fact 
that an instructor likes a book is Obviously no guarantee that his 
students will follow suit. Instructors have even been known to choose 
two textbooks for a course, taking their lecture notes from the better 
of the two and assigning the remaining book as the course text! 

The yellow-covered volume that is the subject of this review could 
presumably serve either of these functions. It consists of 13 chapters 
grouped into four parts: Fundamental Concepts, Descriptive Statis- 
tics, Inferential Statistics, and Nonparametric Statistics. 1 didn't quite 
understand the purpose of the illustration on the front cover—a pro- 
file of a human head with an irregular line graph running sagitally 
through it, but that is obviously less important than understanding 
what follows the Cover. What does follow is a highly verbal book 
ERU по statistical symbols before Chapter 5. At that point, 
xi dies Symbols begin to fall thick and fast and the coverage be- 
мем ien a Мекеге of central tendency and variation 
ия опе chapter, followed іп rapid-fire succession by 
d nis Т ation, t tests, Е tests, analysis of variance, and a bit about 
fans Bre ja Principles, rather than mathematics, constitute the 
specialization [3 беле Spang Br betrays nepos is 
Onna diay aa fic; ple when he derives formulas pertaining to the 

П coetüicient and takes many of his examples from the field of 
psychological testing. This i qua 8 
dwell B. This is to be expected, since it is difficult not to 
ell on what one knows best. For this reason istics books written 
by experimental psychologi or , Statistics boo ! 
а 1 0081515 tend to stress inference and experimen- 
ign. in contrast, those written by specialists i logical and 
educational meas; Erin psychologica a 
urement, such as Nunnally and Guilford (see Guil- 


ford & Fruchter. 1973) gi с 
5 give more i ational analvsis 
and norms. attention to correlational analy 
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The author of the present textbook had three collaborators—Rob- 


` ert L. Durham, L. Charles Lemond, and William Н. Wilson. Never- 


theless, several typos, errors and inconsistencies sneaked through. For 
example, formula 5-6 on p. 104 is incomplete, and the definition of 
percentile is patently wrong, viz. “the percentage of persons who fall 
below a particular score" (р. 119). The correct definition of the pth 
percentile is *. . . that value on the scale of measurement below which 
p percent of the cases in the distribution fall, . . . whereas its corre- 
sponding percentage is known as the percentile rank." (McCall, 1975, 
p. 64) An inconsistency occurs when, in the preface, the author dis- 
parages the inclusion of “museum pieces" in statistics texts, and then 
proceeds in Chapter 5 to describe an archaic statistic known as the 
average deviation. Another inconsistency occurs in defining 57 in two 
ways—with М and №1 in the denominator. This double usage will 
probably create confusion for the introductory student who reaches 


the section in Chapter 9 on computing the standard error of the mean. 


In the preface the author stated that only topics that are widely 
encountered in psychology and education are discussed in the book. 
To be certain of this, one would need to tabulate the usage of certain 
statistical methods in articles appearing in professional journals in 
these fields. If this were done, I seriously doubt whether the average 
deviation, or even the mode, would qualify for inclusion in the class of 
most frequently used statistics. On the other hand, the binomial proba- 
bility formula, the test for difference between proportions, partial 
correlation, test of significance of regression coefficients, the test that 
two correlation coefficients are equal, and even analysis of variance by 
ranks are methods used in many published investigations but omitted 
in the present textbook. Finally, although this book purports to em- 
phasize the understanding of principles and concepts rather than 
mathematical skills, | missed any reference to the important topics of 
the central limit theorem, Type 1 and Type И errors, the power of a 
statistical test, the consistency and bias of estimators, or even the 
descriptive statistic of kurtosis. à 

In spite of these shortcomings, it can be concluded that, to a certain 
extent, the author and his collaborators have achieved their objec- 
lives—to write an interesting book emphasizing understanding rather 
than skill, including methods used most widely in psychology and 
education, dealing with description more than inference, and trying 
not to overwhelm the student with mathematics and computation. 
Whether or not these objectives are the best ones to follow in writing 
àn introductory statistics text is debatable. For example, interest and 
understanding are often obtained at the expense of comprehensive 
Survey and in-depth coverage. It can also be argued that since the first 
Course in statistics is the only one that most students will take and even 
those who go on to advanced courses benefit from a thorough survey 
in the first course, the beginning statistics textbook should be com- 
Prehensive. It should not only give a thorough coverage of fundamen- 
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tals but also serve as a sourcebook and reference manual long after 
90% of the course experiences have been relegated to the same neth- 
erworld as other college educational experiences. 

Authors of statistics textbooks sometimes try too hard to minimize 
"symbol shock" in the mathophobes who are forced to take statistics 
in order to be certified to “work with people." Although social science 
and humanities majors frequently suffer from an inability to manipu- 
late numbers and algebraic symbols, a more serious deficit may exist in 
their ability to do precise abstract thinking of the problem-solving 
type. This is especially evident when students do fairly well with the 
descriptive aspects of statistics but begin to have headaches when the 
topic of statistical inference is introduced. It is difficult to prevent or 
cure such headaches; even when a predominantly verbal text is used, 
the concepts still require logical, reflective thinking. 
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lected later reduced to a "cohort" of 10,317 cases for which Henmon- 
“Nelson measures of mental ability were available. 

Path coefficient or path analysis diagrams were drawn to illustrate 
the relationships between the variables listed above. The numerical 
entries in these diagrams are path coefficients or regression coefficients 
in standard score form. The squares of these coefficients are 
coefficients of determination. These may be used to indicate the pro- 
portions of an effect variable to be attributed to various causal ones. 

Sewall and Hauser credit the path analysis procedure to Otis Dudley 
Duncan without mentioning its history which goes back to its origina- 
tion by Sewall Wright in 1921 and its use since that date by a number 
of persons including this reviewer. 

Sewall, Hauser, and their associates deserve commendation in study- 
ing the possible effects of nonresponse in biasing the data, They dem- 
onstrate that nonresponse had little effect on the general patterns of 
the 1957 and 1964 data. “Bias due to nonresponse has been shown not 
to affect materially either the univariate or the multivariate statistics of 
key variables." (p. 42). 

Chapter 7 summarizes the relationships between socioeconomic 
background and mental ability as causes and educational, occupa- 
tional, and financial achievements as effects. There is excellent critical 
discussion of social psychological factors, college effects, and ability- 
| Schooling interactions. 


Max D. ENGELHART 


David M. Shoemaker. Principles and Procedures of Multiple Matrix 
Sampling. Cambridge, Massachusetts: Ballinger Publishing Com- 
pany, 1973. Pp. хуш + 305. $12.50. 


This text serves as a reference manual for educational and psycho- 
logical researchers and evaluators interested in the theory, devel- 
opment, and implementation of multiple matrix sampling designs. As 
Such, it represents an organization and synthesis of the multiple matrix 
sampling literature through the end of 1973. According to the author, 
“Throughout [the] book an attempt has been made to keep the prac- 
| titioner clearly in mind. The emphasis is clearly on the why, when, and 
how to use multiple matrix sampling." With its emphasis on practical 
techniques, the book is designed to make multiple matrix sampling 
More readily available to those interested in the assessment of group 
Performance in a wide variety of areas. | 
The book is divided into two parts: In the first, the author in- 
ttoduces the multiple matrix sampling model, presents relevant theory, 
discusses applications, and considers guidelines for utilization. Specifi- 
cally, the first part of the book consists of seven chapters spanning 85 
Pages of text and covering the following areas: 1. Definition, advan- 
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tages, limitations, and applications of multiple matrix sampling, 2. 
guidelines for the utilization of multiple matrix sampling, 3. com- 
putational formulas, 4. use of computer simulation techniques in de- 
signing multiple matrix sampling studies, 5. hypothesis testing, 6. 
applications, and 7. future possibilities. 

While we are generally very positive about the coverage of topics in 
the first part of the book, we did uncover a nuhber of weaknesses that 
we feel reduce the usefulness of the book. First, because the field of 
multiple matrix sampling is expanding at a tremendous rate, the book 
will be quickly dated. For example, important developments such as 
the use of multiple matrix sampling scores for examinee ability estima- 
tion (Bunda, 1973) and research on the problem of context effects 
(Feldt and Forsyth, 1974) were not available to the author. (In fact, 
recent research results on context effect are opposite to the results 
reported in the book and the tentative conclusions drawn by the 
author.) Thus, while the book serves as an excellent starting point for 
the study of multiple matrix sampling, an individual desiring an up-to- 
date comprehensive coverage of the field would now need to include 
new contributions to the field (e.g., Sirotnik, 1974) in his/her readings. 

Our second and most serious criticism concerns the chapter on 
guidelines. Since the book is primarily intended for practitioners, the 
guidelines should be clear and detailed; instead the guidelines are 
presented in a general form. Also, little in the way of a rationale for the 
guidelines is presented and there is no mention of practical procedural 
guidelines such as those concerning economic constraints, and strate- 
gies for handling multiple item types and directions. One or more 
carefully worked examples showing the steps in designing and imple- 
ТЕ à multiple matrix sampling study would have been informa- 
"ree part of the book includes chapter references, a very 
ara Meer y on multiple matrix sampling, and a 217 page 
ШЕПНЕ muli n two computer programs designed for use in imple- 
ЧЫ ЫЧ qs matrix sampling studies. The first program has been 

9 allow the user to estimate parameters in the multiple 


matrix sampling model. The second allows the user to simulate various 
multiple matrix sampling design 
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d. technically correct, carefully edited, and highly suited for self 
у. When the reader recovers from the shock of discovering that 
те than half the book consists of computer program listings, we 
nk he/she will find it to be an important and useful contribution to 
ychometric research field. Also, we feel sure that the book will 
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alph W. Tyler and Richard M. Wolf (Eds.) Crucial Issues in Testing. 
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Any book that presumes to summarize issues in a diverse profes- 
ional field presents its authors with a choice between giving balanced 
atment (equal time?) to competing viewpoints or offering a consid- 
d judgment based on a review of the arguments or evidence, putting 
k leading to whatever evaluative con- 
ion has been reached. Tyler and Wolf have chosen the latter 
ernative. Even where articles by others have been used to reflect 
ferent views, a conclusion emerges. 
Two questions need answering in the review of such a publication as 
First, did the authors choose the truly crucial issues? Second, how 
ell did they summarize and generalize the state of the art? 
- The seven-part table of contents answers the first question affirma- 
ely, covering (1) the testing of minority groups, (2) selective testing 
'higher education, (3) testing for grouping students for instruction, 
criterion-referenced testing, (5) assessing the educational achieve- 
ent of schools or school systems, “accountability,” (6) testing to 
ialuate effectiveness of programs, methods, and materials of instruc- 
‚ and (7) testing and the invasion of privacy. Inevitably some of 
topics overlap or involve interaction, e.g., ability grouping and 


1062 EDUCATIONAL AND PSYCHOLOGICAL MEASUREMENT 


the testing of minority groups, but the seven parts focus on definable 
areas of test use requiring attention by builders and users of tests in 
education without serious omission except possibly the use of tests in 1 
evaluating non-traditional acquisition of certifiable mastery of aca- | 
demic learning for high school or college credit. 

Within the seven areas, then, the several sections may be evaluated 
seriatim. His proper concern regarding the testing situatión's impact 
on minority individuals and their performance is clearly and temper- 
ately articulated by Robert Williams for the Association of Black 4 
Psychologists. It is treated comprehensively by Messick and Ander- 
son, who admit the fairness of many of the criticisms, but conclude 
their rejection of the proposed moratorium on all "psychological" 
lesting by pointing to the greater unfairness to be expected when 
stereotyped thinking in the dominant white group is not tempered by 
the objectivity of test performance. Robert Thorndike’s sophisticated 
analysis of test fairness to groups should be read here if one has по! , 
encountered it previously. Use of a predictor variable to produce no 
greater group discrimination in selection than is found in criterion 
mastery after training is a powerful concept this reviewer finds a step 
forward from previous simpler concepts of fairness that ask only for 
comparable predictive validity. 

The three excerpts from the 1970 report of the College Entrance 
Examination Board's Commission on Tests are commendably self- 
critical. One may wonder why something on service to public colleges, 
more characteristic of the American College Testing Program, was not , 
included to show the problems of adaptation to the needs of in- 
Stitutions that have been drawn sooner and farther into open admis 
Sions, А useful historical function is performed by reviewing the trend 
within this reviewer's life-time in high school graduation and college 
attendance. Admission 10 "prestige" colleges is still an active concern 
ob] deers mobile middle-class families and influences the public 

i |. Edmund Gordon $ brief for a more functional placement em- 
р "y in the achievement tests is especially wholesome if some may 
унны of tests. Опе may point to considerable place- 
information obtain able aon and present, and expect supplementary 
full. The College ev T Tom and about applicants to be used to the 
ment Tests Кы obvio а and the Advanced Place 

Tylers compote bean moves in a constructive direction. pos 
е туа X noe ed of the case against “ability grouping” is 
reviewer simply dium: ensive recent survey coauthored by this 
detail. omegenctt cn the views attributed to Heathers in greater 
curriculum is chin d Jevement across even the strictly acad e 
such homogeneit; kc iles. К eerie qi с 
the expense of dn dí low a few leaders to run ahead, but only at 
and understimulated ү e ini the progress of stigmatized 

intellectually segregated) low track groups. 


Again, testing is not the chief culprit, but the user of tests who, if 
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ived of tests, will make even worse classification schemes out of 
prejudice and/or stereotyping. 
asian and Madaus may be credited with a forceful, albeit some- 
polemical statement of the case for criterion-referenced testing. 
strengths and merits of measuring qualitatively what behaviors a 
t can perform are clearly presented against a backdrop of the 
ittedly unimaginative use often made of even the best stand- 
zed tests of achievement. The villain is not the norm-referenced 
per se, but rigidities of school operation that persist in hurrying all 
g at a uniform pace. Norm-referenced tests were welcomed and 
| be used as means of describing what traditional school arrange- 
nts engender. Criterion-referenced measurement promotes а better 
Us on the teaching-learning process than group measures based on 
limited item samples of standardized achievement tests. But also 
ге remain subareas of achievement not prescribable for mastery 
thing by all. The active, well-honed, highly-motivated mind will 
in understandings beyond any specifiable set of outcomes in essen- 
lly domain-referenced areas, to use Gronlund's term. Criterion- 
enced testing is a welcome addition to our armamentarium of 
rement that will increasingly serve to define meaningful, attain- 
le goals to the many we teach. It also has the virtue of goal-oriented 
uation, freed of unnecessary comparisons and contrasts. But let 
re continue to be recognized desirable levels of understanding, 
t and even mastery not all will achieve, in addition to the min- 
essentials specified in operational terms measurable by mastery 


Assessment approaches at national, state and local levels are dis- 
sed in turn. Highly efficient summaries of purpose, design and use 
presented for the National Assessment of Educational Progress 
ХЕР). This type of assessment, which this reviewer remembers then 
missioner George Stoddard recommending to assembled school- 
en in New York State as long ago as 1943, is already in its first 
е yielding the benefits of triennial cycles of assessment of ten 
at four age levels with other well-conceived breakdowns, geogra- 


is consequently more 
у for the purposes of this pub- 
d Rosenthal’s statement that 


roups over time. Such differences are being worked out, but pro- 
frustrations and animosi 
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dissipate. An added note is that a recent Alabama study has provided 
a regression model for interpreting the achievement level of school 
systems relative to expectations based on economic indices, thereby 
allowing good teaching and learning in disadvantaged situations to 
show to advantage relative to expectations rather than unfavorably 
relative to national norms. 

Wolf's call for local programs responsive to immediate demands for 
accountability includes an incidental account of the history of achieve- 
ment testing since Rice's 1897 spelling studies that does justice to the 
major trends and counter trends. Perhaps the 1972 account this revie- 
wer remembers of a Dallas (Texas) criterion-reverenced testing pro- 
gram built and normed within the system augurs a reconciliation of 
emphases now treated as antithetical. 

Who better than Tyler could review the history and outline the 
current situation regarding use of tests in curriculum evaluation? The 
call is for specificity in measurement directed to the essential specificity 
of learning, a far cry from the bold measures of general qualities like 
excessive caution and over-generalization in “interpretation of data" 
tests. Students do develop abilities over time out of more specific 
directed learning. Generalizations are judicious syntheses of well- 
learned specifics; can they not do more than coexist peacefully, but 
build interactively on each other? 

Wolf's discussion of the invasion of privacy via testing has a balance 
that suggests the possibility of clearly established ground rules to 
protect privacy without hobbling use of testing for educational diag- 
nosis and research. It will take doing, but can be done. 

In sum, Crucial Issues in Testing raises most of the critical issues to 
the point of visibility, is sometimes one-sidedly polemical, but else- 
where points the way to future realization of important new benefits 


URNA ог reconciliation of conflicting considerations. Y ou'd better read 
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